by Roman Podoliaka, PyCon Ukraine 2016, Lviv, 23/04/2016
@rpodoliaka
Jupyter notebook: http://bit.ly/1qYFIJ2
I'm a development manager at Mirantis - don't write much code, mostly debug (performance) issues
OpenStack, as a cloud platform, must work fast and be able to scale - performance / scale issues are in every single service
when working on a performance issue, (even experienced) engineers tend to guess (wrong) instead of using proven methods
They often say:
modern CPUs are "fast enough"
Python is one of slowest programming languages out there - if we wanted performance, we would choose C/C++/Rust/Java/Go/etc
... that's true but:
Python can be fast, if used properly
writing efficient code saves money in modern "cloud economy" (e.g. you need fewer VM instances or the same number of VMs for a shorter period of time, or just less expensive VMs)
even if you don't use clouds, a particular problem domain can have strict performance requirements (e.g. when you have to process a chunk of data in time before the next chunk arrives)
a basic data compression algorithm
the underlying idea: more common symbols are represented using fewer bits than less common symbols
a CPU-bound task
def count_freqs(f, size=1):
"""Returns a (symb -> freq) mapping for a given bytes stream"""
freqs = {}
for symb in iter(lambda: f.read(size), b''):
if symb in freqs:
freqs[symb] += 1
else:
freqs[symb] = 1
return freqs
import functools
@functools.total_ordering
class Node(object):
"""Represents a node in a Huffman tree. If the symbol is None, this is an internal node."""
def __init__(self, freq=0, symb=None, left=None, right=None):
self.freq = freq
self.symb = symb
self.left = left
self.right = right
def __eq__(self, rhs):
return self.freq == rhs.freq
def __lt__(self, rhs):
return self.freq < rhs.freq
def build_tree(freqs):
"""Build a Huffman tree given a symbol frequencies map."""
queue = [Node(freq=v, symb=k) for k, v in freqs.items()]
while len(queue) > 1:
queue.sort(reverse=True)
right = queue.pop()
left = queue.pop()
queue.append(Node(freq=left.freq + right.freq, left=left, right=right))
return queue.pop()
def huffman_prepare(f):
return build_tree(count_freqs(f))
# a piece of art
war_and_peace = open('data/War_and_Peace.txt', 'rb').read()
print(len(war_and_peace))
3226757
def benchmark():
with open('data/War_and_Peace.txt', 'rb') as f:
return huffman_prepare(f)
%timeit benchmark()
1 loop, best of 3: 1.85 s per loop
~2 seconds for a 3 MB text file - that's not really fast.
How to make it run faster?
Start trying things at random!..
@functools.total_ordering
class Node(object):
# slots must be faster!
__slots__ = ('freq', 'symb', 'left', 'right')
def __init__(self, freq=0, symb=None, left=None, right=None):
self.freq = freq
self.symb = symb
self.left = left
self.right = right
def __eq__(self, rhs):
return self.freq == rhs.freq
def __lt__(self, rhs):
return self.freq < rhs.freq
def benchmark():
with open('data/War_and_Peace.txt', 'rb') as f:
return huffman_prepare(f)
%timeit benchmark()
1 loop, best of 3: 1.82 s per loop
import collections
def count_freqs(f, size=1):
# defaultdict is definitely better!
freqs = collections.defaultdict(int)
for symb in iter(lambda: f.read(size), b''):
freqs[symb] += 1
return freqs
def benchmark():
with open('data/War_and_Peace.txt', 'rb') as f:
return huffman_prepare(f)
%timeit benchmark()
1 loop, best of 3: 1.72 s per loop
import operator
def build_tree(freqs):
queue = [Node(freq=v, symb=k) for k, v in freqs.items()]
while len(queue) > 1:
# key + operator.attrgetter make sorting fast!!!
queue.sort(key=operator.attrgetter('freq'), reverse=True)
right = queue.pop()
left = queue.pop()
queue.append(Node(freq=left.freq + right.freq, left=left, right=right))
return queue.pop()
def benchmark():
with open('data/War_and_Peace.txt', 'rb') as f:
return huffman_prepare(f)
%timeit benchmark()
1 loop, best of 3: 1.71 s per loop
... and developers keep walking in circles
Stop trying things at random!
import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
...
In the face of ambiguity, refuse the temptation to guess.
make observations: benchmark your code and understand how much time it takes to execute each part of it
choose the part to be improved (ideally, you should be able to predict the performance effect of the improvement in advance)
analyze and improve the code part in question, based on the observations done
repeat until satisfied with results (e.g. execution time)
Now you essentially need to somehow get a profile of your code, i.e. to understand how it runs and where it spends time
user code is somehow instrumeted to get precise timings for function calls and returns (e.g. at compile time for programming languages like C or C++)
for interpreted / VM powered languages support for deterministic profiling is usually provided by the runtime
not only allows to get the precise timings for calls, but to see the exact number of calls (the latter has an interesting use case: e.g. SQLAlchemy tests check for unexpected changes in function call counts)
the passed function will be called on each function call / return / exception raised
trace functions should have three arguments: frame
, event
, and arg
thread-specific (i.e. the hook must be set for each thread being profiled)
a pure-python module which implements profiling of Python code
needs calibration to produce correct timings (https://docs.python.org/2.7/library/profile.html#calibration)
# https://github.com/python/cpython/blob/2.7/Lib/profile.py
def trace_dispatch(self, frame, event, arg):
timer = self.timer
t = timer()
t = t[0] + t[1] - self.t - self.bias
if event == "c_call":
self.c_func_name = arg.__name__
if self.dispatch[event](self, frame, t):
t = timer()
self.t = t[0] + t[1]
else:
r = timer()
self.t = r[0] + r[1] - t # put back unrecorded delta
same as profile
, but implemented in C to reduce the profiling overhead
default choice when you need to profile Python code
python -m cProfile -s cumulative my_script.py
import cProfile
import re
cProfile.run('re.compile("foo|bar")', 'restats')
import cProfile
pr = cProfile.Profile()
pr.enable()
# ... do something ...
pr.disable()
back to the case study...
# profiling in Jupyter/iPython
%prun -s cumulative benchmark()
6454139 function calls in 3.258 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3.258 3.258 benchmark
1 0.000 0.000 3.257 3.257 huffman_prepare
1 1.457 1.457 3.256 3.256 count_freqs
3226758 1.119 0.000 1.799 0.000 <lambda>
3226758 0.680 0.000 0.680 0.000 {method 'read' of '_io.BufferedReader' objects}
1 0.000 0.000 0.001 0.001 <ipython-input-160-4e0db1ec3d41>:4(build_tree)
87 0.001 0.000 0.001 0.000 {method 'sort' of 'list' objects}
# profiling in Jupyter/iPython
%prun -D data/cprofile.result benchmark()
*** Profile stats marshalled to file 'data/cprofile.result'.
# now that we know where to look for improvement, let's reduce the number of read() calls
def count_freqs(f, size=65536):
"""Returns a (symb -> freq) mapping for a given bytes stream"""
freqs = {}
for chunk in iter(lambda: f.read(size), b''):
for symb in chunk:
if symb in freqs:
freqs[symb] += 1
else:
freqs[symb] = 1
return freqs
# now we are talking!
def benchmark():
with open('data/War_and_Peace.txt', 'rb') as f:
return huffman_prepare(f)
%timeit benchmark()
1 loop, best of 3: 604 ms per loop
let's check how the cProfile report changed...
# profiling in Jupyter/iPython
%prun -s cumulative benchmark()
725 function calls in 0.684 seconds
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.680 0.680 0.682 0.682 count_freqs
51 0.000 0.000 0.002 0.000 <lambda>
51 0.002 0.000 0.002 0.000 {method 'read' of '_io.BufferedReader' objects}
1 0.000 0.000 0.001 0.001 build_tree
87 0.001 0.000 0.001 0.000 {method 'sort' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {built-in method iter}
a thread call stack is examined at regular intervals
a sampling period value can be tuned to achieve the acceptable execution overhead
fuction calls shorter than the sampling period value will be missed
/Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(61)<module>()
-> benchmark()
/Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(57)benchmark()
-> return huffman_prepare(f)
/Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(52)huffman_prepare()
-> return build_tree(count_freqs(f))
> /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(10)count_freqs()
-> if symb in freqs:
import collections
import signal
class Sampler(object):
def __init__(self, interval=0.001):
self.stack_counts = collections.defaultdict(int)
self.interval = 0.001
def _sample(self, signum, frame):
stack = []
while frame is not None:
formatted_frame = '{}({})'.format(frame.f_code.co_name,
frame.f_globals.get('__name__'))
stack.append(formatted_frame)
frame = frame.f_back
formatted_stack = ';'.join(reversed(stack))
self.stack_counts[formatted_stack] += 1
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
def start(self):
signal.signal(signal.VTALRM, self._sample)
signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)
a stack-sampling profiler written in Python
supports FlameGraph visualization (https://github.com/brendangregg/FlameGraph)
usage:
python -m plop.collector <program.py>
import plop.collector as plop
collector = plop.Collector(interval=0.01, mode="prof")
collector.start()
benchmark()
collector.stop()
formatter = plop.PlopFormatter()
formatter.store(collector, 'data/plop.result')
line-by-line profiling of Python code: https://github.com/rkern/line_profiler
statistical profiler for PyPy: https://vmprof.readthedocs.org/en/latest/
profiling of applications with perf_events: http://www.slideshare.net/brendangregg/scale2015-linux-perfprofiling
don't try to guess when unsure - use the existing methodologies
the performance bottlneck is usually not where you would expect it to be
pick the right tool for the job (exact numbers vs execution overhead)
Python can be fast. Let's profile our software before pointing a finger!
questions?
@rpodoliaka