CPU profiling of Python applications¶

by Roman Podoliaka, PyCon Ukraine 2016, Lviv, 23/04/2016

@rpodoliaka

Jupyter notebook: http://bit.ly/1qYFIJ2

slides: http://podoliaka.org/talks/profiling/

Why this topic¶

I'm a development manager at Mirantis - don't write much code, mostly debug (performance) issues
Mirantis is a contributor to OpenStack (~ "free" AWS)
OpenStack, as a cloud platform, must work fast and be able to scale - performance / scale issues are in every single service
when working on a performance issue, (even experienced) engineers tend to guess (wrong) instead of using proven methods

Why you should care about performance¶

They often say:

modern CPUs are "fast enough"
Python is one of slowest programming languages out there - if we wanted performance, we would choose C/C++/Rust/Java/Go/etc

... that's true but:

Python can be fast, if used properly
writing efficient code saves money in modern "cloud economy" (e.g. you need fewer VM instances or the same number of VMs for a shorter period of time, or just less expensive VMs)
even if you don't use clouds, a particular problem domain can have strict performance requirements (e.g. when you have to process a chunk of data in time before the next chunk arrives)

Case study: Huffman coding¶

a basic data compression algorithm
the underlying idea: more common symbols are represented using fewer bits than less common symbols
a CPU-bound task

In [149]:

def count_freqs(f, size=1):
    """Returns a (symb -> freq) mapping for a given bytes stream"""
    
    freqs = {}
    
    for symb in iter(lambda: f.read(size), b''):
        if symb in freqs:
            freqs[symb] += 1
        else:
            freqs[symb] = 1
            
    return freqs

In [150]:

import functools


@functools.total_ordering
class Node(object):
    """Represents a node in a Huffman tree. If the symbol is None, this is an internal node."""
    
    def __init__(self, freq=0, symb=None, left=None, right=None):
        self.freq = freq
        self.symb = symb
        self.left = left
        self.right = right
        
    def __eq__(self, rhs):
        return self.freq == rhs.freq
    
    def __lt__(self, rhs):
        return self.freq < rhs.freq

In [151]:

def build_tree(freqs):
    """Build a Huffman tree given a symbol frequencies map."""
    
    queue = [Node(freq=v, symb=k) for k, v in freqs.items()]
    
    while len(queue) > 1:
        queue.sort(reverse=True)
        
        right = queue.pop()
        left = queue.pop()
        
        queue.append(Node(freq=left.freq + right.freq, left=left, right=right))
        
    return queue.pop()

In [152]:

def huffman_prepare(f):
    return build_tree(count_freqs(f))

In [153]:

# a piece of art

war_and_peace = open('data/War_and_Peace.txt', 'rb').read()
print(len(war_and_peace))

In [155]:

def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()

1 loop, best of 3: 1.85 s per loop

~2 seconds for a 3 MB text file - that's not really fast.

How to make it run faster?

Start trying things at random!..

In [156]:

@functools.total_ordering
class Node(object):
    # slots must be faster!
    __slots__ = ('freq', 'symb', 'left', 'right')
    
    def __init__(self, freq=0, symb=None, left=None, right=None):
        self.freq = freq
        self.symb = symb
        self.left = left
        self.right = right
        
    def __eq__(self, rhs):
        return self.freq == rhs.freq
    
    def __lt__(self, rhs):
        return self.freq < rhs.freq

In [157]:

def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()

1 loop, best of 3: 1.82 s per loop

In [158]:

import collections


def count_freqs(f, size=1):
    # defaultdict is definitely better!
    freqs = collections.defaultdict(int)
    
    for symb in iter(lambda: f.read(size), b''):
        freqs[symb] += 1
            
    return freqs

In [159]:

def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()

1 loop, best of 3: 1.72 s per loop

In [160]:

import operator


def build_tree(freqs):
    queue = [Node(freq=v, symb=k) for k, v in freqs.items()]
    
    while len(queue) > 1:
        # key + operator.attrgetter make sorting fast!!!
        queue.sort(key=operator.attrgetter('freq'), reverse=True)
        
        right = queue.pop()
        left = queue.pop()
        
        queue.append(Node(freq=left.freq + right.freq, left=left, right=right))
        
    return queue.pop()

In [161]:

def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()

1 loop, best of 3: 1.71 s per loop

... and developers keep walking in circles

Stop trying things at random!

In [162]:

import this


The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
...

In the face of ambiguity, refuse the temptation to guess.

What you should really do¶

make observations: benchmark your code and understand how much time it takes to execute each part of it
choose the part to be improved (ideally, you should be able to predict the performance effect of the improvement in advance)
analyze and improve the code part in question, based on the observations done
repeat until satisfied with results (e.g. execution time)

Amdahl's law¶

allows to find the maximum expected improvement to an overall system, when only one part of the system is improved

Now you essentially need to somehow get a profile of your code, i.e. to understand how it runs and where it spends time

Deterministic profiling¶

user code is somehow instrumeted to get precise timings for function calls and returns (e.g. at compile time for programming languages like C or C++)
for interpreted / VM powered languages support for deterministic profiling is usually provided by the runtime
not only allows to get the precise timings for calls, but to see the exact number of calls (the latter has an interesting use case: e.g. SQLAlchemy tests check for unexpected changes in function call counts)

sys.setprofile(profilefunc)¶

the passed function will be called on each function call / return / exception raised
trace functions should have three arguments: frame, event, and arg
thread-specific (i.e. the hook must be set for each thread being profiled)

profile¶

a pure-python module which implements profiling of Python code
https://docs.python.org/2.7/library/profile.html
needs calibration to produce correct timings (https://docs.python.org/2.7/library/profile.html#calibration)

In [163]:

# https://github.com/python/cpython/blob/2.7/Lib/profile.py

def trace_dispatch(self, frame, event, arg):
    timer = self.timer
    t = timer()
    t = t[0] + t[1] - self.t - self.bias

    if event == "c_call":
        self.c_func_name = arg.__name__

    if self.dispatch[event](self, frame, t):
        t = timer()
        self.t = t[0] + t[1]
    else:
        r = timer()
        self.t = r[0] + r[1] - t # put back unrecorded delta

cProfile¶

same as profile, but implemented in C to reduce the profiling overhead
default choice when you need to profile Python code

cProfile usage: profiling a standalone script¶

python -m cProfile -s cumulative my_script.py

cProfile usage: profiling a Python statement¶

import cProfile
import re

cProfile.run('re.compile("foo|bar")', 'restats')

cProfile usage: programmatically¶

import cProfile


pr = cProfile.Profile()
pr.enable()

# ... do something ...

pr.disable()

back to the case study...

In [165]:

# profiling in Jupyter/iPython

%prun -s cumulative benchmark()



6454139 function calls in 3.258 seconds

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000    3.258    3.258 benchmark
      1    0.000    0.000    3.257    3.257 huffman_prepare
      1    1.457    1.457    3.256    3.256 count_freqs
3226758    1.119    0.000    1.799    0.000 <lambda>
3226758    0.680    0.000    0.680    0.000 {method 'read' of '_io.BufferedReader' objects}
      1    0.000    0.000    0.001    0.001 <ipython-input-160-4e0db1ec3d41>:4(build_tree)
     87    0.001    0.000    0.001    0.000 {method 'sort' of 'list' objects}

In [166]:

# profiling in Jupyter/iPython

%prun -D data/cprofile.result benchmark()

 
*** Profile stats marshalled to file 'data/cprofile.result'.

In [170]:

# now that we know where to look for improvement, let's reduce the number of read() calls

def count_freqs(f, size=65536):
    """Returns a (symb -> freq) mapping for a given bytes stream"""
    
    freqs = {}
    
    for chunk in iter(lambda: f.read(size), b''):
        for symb in chunk:
            if symb in freqs:
                freqs[symb] += 1
            else:
                freqs[symb] = 1
            
    return freqs

In [171]:

# now we are talking!

def benchmark():
    with open('data/War_and_Peace.txt', 'rb') as f:
        return huffman_prepare(f)

    
%timeit benchmark()

1 loop, best of 3: 604 ms per loop

let's check how the cProfile report changed...

In [173]:

# profiling in Jupyter/iPython

%prun -s cumulative benchmark()


   725 function calls in 0.684 seconds

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.680    0.680    0.682    0.682 count_freqs
    51    0.000    0.000    0.002    0.000 <lambda>
    51    0.002    0.000    0.002    0.000 {method 'read' of '_io.BufferedReader' objects}
     1    0.000    0.000    0.001    0.001 build_tree
    87    0.001    0.000    0.001    0.000 {method 'sort' of 'list' objects}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1    0.000    0.000    0.000    0.000 {built-in method iter}

Deterministic profiling (summary)¶

Pros:¶

all events are monitored (function calls, returns, exceptions)
precise timings are made for the intervals between events

Cons:¶

(relatively) high overhead, as all calls are instrumented
calls done in native libraries / modules are not profiled

Statistical profiling¶

a thread call stack is examined at regular intervals
a sampling period value can be tuned to achieve the acceptable execution overhead
fuction calls shorter than the sampling period value will be missed

Call stack¶

/Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(61)<module>()
-> benchmark()
  /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(57)benchmark()
-> return huffman_prepare(f)
  /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(52)huffman_prepare()
-> return build_tree(count_freqs(f))
> /Users/malor/Dropbox/talks/uapycon-profiling/code/allinone.py(10)count_freqs()
-> if symb in freqs:

Sampling: implementation¶

https://nylas.com/blog/performance

import collections
import signal


class Sampler(object):
    def __init__(self, interval=0.001):
        self.stack_counts = collections.defaultdict(int)
        self.interval = 0.001

    def _sample(self, signum, frame):
        stack = []
        while frame is not None:
            formatted_frame = '{}({})'.format(frame.f_code.co_name,
                                              frame.f_globals.get('__name__'))
            stack.append(formatted_frame)
            frame = frame.f_back

        formatted_stack = ';'.join(reversed(stack))
        self.stack_counts[formatted_stack] += 1
        signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)

    def start(self):
        signal.signal(signal.VTALRM, self._sample)
        signal.setitimer(signal.ITIMER_VIRTUAL, self.interval, 0)

plop¶

https://github.com/bdarnell/plop
a stack-sampling profiler written in Python
supports FlameGraph visualization (https://github.com/brendangregg/FlameGraph)
usage:

python -m plop.collector <program.py>

In [176]:

import plop.collector as plop


collector = plop.Collector(interval=0.01, mode="prof")
collector.start()

benchmark()

collector.stop()

In [177]:

formatter = plop.PlopFormatter()
formatter.store(collector, 'data/plop.result')

Statistical profiling (summary)¶

Pros:¶

low overhead, thus it's possible to have profiling enabled in production
it's possible to get call stacks from native libraries

Cons:¶

timings are not precise - you can easily miss short function calls
you can't get the exact number of function calls done

What else to check out¶

line-by-line profiling of Python code: https://github.com/rkern/line_profiler
statistical profiler for PyPy: https://vmprof.readthedocs.org/en/latest/
profiling of applications with perf_events: http://www.slideshare.net/brendangregg/scale2015-linux-perfprofiling

The take-home message¶

don't try to guess when unsure - use the existing methodologies
the performance bottlneck is usually not where you would expect it to be
pick the right tool for the job (exact numbers vs execution overhead)
Python can be fast. Let's profile our software before pointing a finger!

Thank you!¶

questions?
@rpodoliaka
http://podoliaka.org