Path: blob/master/python/cython/cython.ipynb
2574 views
High Performance Python
Cython
Cython is a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. In short it aims to give the simplicity of Python and efficiency of C. If you like some additional motivation to try it out consider listening to a 20 minute-ish talk from Pycon. Youtube: Cython as a Game Changer for Efficiency
We can write it in notebooks by loading the cython magic.
Static Typing
Cython extends the Python language with static type declarations. This increases speed by not needing to do type-checks when running the program. The way we do this in Cython is by adding the cdef keyword.
We'll write a simple program that increments j by 1 for 1000 times and compare the speed difference when adding the type declaration.
Notice the runtime difference (look at the units)
Functions
To declare functions we use the cpdef keyword.
Notice apart from declaring the function using the cpdef keyword, we also specify the return type to be a integer and a two input argument to be integers.
There's still an overhead to calling functions, so if the function is small and is in a computationally expensive for loop, then we can add the inline keyword in the function declaration. By doing this, it will replace function call solely with the function body, thus reducing the time to call the function multiple times.
Numpy
Typed memoryviews allow even more efficient numpy manipulation since again, it does not incur the python overhead.
Pairwise Distance Example
We'll start by implementing a pure python version of the function that will give us a good benchmark for comparison with Cython alternatives below.
We'll try re-writing this into Cython using type memoryview. The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.
All the commented .pyx code can be found in the github folder. You can simply run python setup.py install to install pairwise1.pyx and pairwise2.pyx.
We can see the huge speedup over the pure python version! It turns out, though, that we can do even better. If we look in the code, the slicing operation when we call X[i] and X[j] must generate a new numpy array each time. So this time, we will directly slice the X array without creating new array each time.
We now try utilize Cython's parallel functionality. For some reason can't compile the parallel version when following Cython's documentation on compiling parallel version that utilizes OPENMP (a multithreading API), will come back to this in the future. Had to take a different route by installing it as if it was a package. You can simply run python setup_parallel.py install to install pairwise3.pyx.
We've touch upon an exmaple of utilizing Cython to speed up or CPU intensive numerical operations. Though, to get the full advantage out of Cython, it's still good to know some C/C++ programming (things like void type, pointers, standard library).
Numba
Numba is an LLVM compiler for python code, which allows code written in Python to be converted to highly efficient compiled code in real-time. To use it, we simply add a @jit (just in time compilation) decorator to our function. We can add arguments to the decorator to specify the input type, but it is recommended not to and simply let Numba decide when and how to optimize.
The @jit decorator tells Numba to compile the function. The argument types will be inferred by Numba when function is called. If Numba can't infer the types, it will fall back to a python object; When this happens, we probably won't see any significant speed up. The numba documentation lists out what python and numpy features are supported.
A number of keyword-only arguments can be passed to the @jit decorator. e.g. nopython. Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython = True to the decorator, so it becomes @jit(nopython = True). Or we can be even lazier and simply use the @njit decorator.
The latest version (released around mid July 2017) 0.34.0 also allows use to write parallel code by specifying the parallel = True argument to the decorator and changing range to prange to perform explicit parallel loops. Note that we must ensure the loop does not have cross iteration dependencies.
Note that we add the @njit decorator, we are marking a function for optimization by Numba's JIT (just in time) compiler, meaning the python code is compiled on the fly into optimized machine code during the first time we invoke the function call. In other words, we can see some additional speed boost the next time we call the function since we won't have the initial compilation overhead.
Little change to the code shows significant gain in speed!!! This insane functionality allows us to prototype algorithms with numpy without lossing the speed of C++ code. If you read it up to this part, consider giving the numba project a star.
For more information, this is a pretty good Pydata talk that illustrates the potential of numba. Youtube: Numba: Flexible analytics written in Python.