%load_ext Cython
preprocess_data_step2.py
Speeding up your code with Cython
Cython
“Cython is a language that makes writing C extensions for the Python language as easy as Python itself”
Cython allows you to write C extensions to Python in a very pythonic manner. In this notebook we'll explore the syntax of Cython and see what kind of speed up we can achieve.
Cython provides an extension for Jupyter notebooks that makes working with Cython very convenient. If you enter %load_ext Cython
the %%cython
magic becomes available.
import numpy
import threadpoolctl
Let us start with something simple. The following function calculates the dot product between two vectors:
def pydot(v, w):
if len(v) == len(w) and len(v) > 0:
res = v[0] * w[0]
for i in range(1, len(v)):
res += v[i] * w[i]
return res
v = numpy.arange(0, 10000, dtype=int)
w = numpy.arange(0, 10000, dtype=int)
%%timeit a = range(10000); b = range(10000)
pydot(a, b)
pydot_timing = %timeit -o pydot(v, w)
Elementwise access to NumPy arrays is often slower as elementwise access to lists.
Now let us invoke Cython
%%cython
def dot(v, w):
if len(v) == len(w) and len(v) > 0:
res = v[0] * w[0]
for i in range(1, len(v)):
res += v[i] * w[i]
return res
%%timeit a = range(10000); b = range(10000)
dot(a, b)
dot1_timing = %timeit -o dot(v, w)
We can check what Cython does by invoking the cell magic with the option -a
.
%%cython -a
def dot(v, w):
if len(v) == len(w) and len(v) > 0:
res = v[0] * w[0]
for i in range(1, len(v)):
res += v[i] * w[i]
return res
This produces an annotated listing. The darker the yellow, the more Python code is in that line. You can click on each line to see the C code that was generated for it. Go ahead and explore.
There are many references to Python objects and functions. Basically, you only save the overhead of interpreting. Let's see if we can do a little better.
Declaring types
The arguments v
and w
are very general. If we know that we are only going to pass ndarrays of integers, we can be more specific:
%%cython -a
import numpy
cimport numpy
def dot(numpy.ndarray[numpy.int_t, ndim = 1] v, numpy.ndarray[numpy.int_t, ndim = 1] w):
cdef long res = 0
if len(v) == len(w) and len(v) > 0:
res = v[0] * w[0]
for i in range(1, len(v)):
res += v[i] * w[i]
return res
I also added cdef long res = 0
. This declares a C variable. C variables are not visible to a Python program. Here this doesn't matter since res is only used within the function.
dot2_timing = %timeit -o dot(v, w)
Note that cdef long res = 0
is white. It's translated directly into C. len(v)
is a Python call. It can be pulled out and assigned to an integer variable.
%%cython -a
import numpy as np
cimport numpy as np
def dot(np.ndarray[np.int_t, ndim = 1] v, np.ndarray[np.int_t, ndim = 1] w):
cdef long res = 0
cdef int n = 0
n = len(v)
if n > 0 and n == len(w):
for i in range(0, n):
res += v[i] * w[i]
return res
dot3_timing = %timeit -o dot(v, w)
If you look at line 10, you'll notice that there are some bounds checks, but they are unnecessary since we check the bounds beforehand already. Let's do away with them.
Skipping checks
%%cython -a
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
def dot(np.ndarray[np.int64_t, ndim = 1] v, np.ndarray[np.int64_t, ndim = 1] w):
cdef long res
cdef int i
cdef int n
n = len(v)
if n > 0 and n == len(w):
res = v[0] * w[0]
for i in range(1, n):
res += v[i] * w[i]
return res
dot4_timing = %timeit -o dot(v,w)
We can dispense of wraparound a[-1]->a[len(a)-1], too.
%%cython -a
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def dot(np.ndarray[np.int64_t, ndim = 1] v, np.ndarray[np.int64_t, ndim = 1] w):
cdef long res
cdef int i
cdef int n
n = len(v)
if n > 0 and n == len(w):
res = v[0] * w[0]
for i in range(1, n):
res += v[i] * w[i]
return res
dot5_timing = %timeit -o dot(v,w)
Adding compiler option
Cython magic
%%cython --compile-args=-O2 --compile-args=-march=native --compile-args=-w --force
import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def dot(np.ndarray[np.int64_t, ndim = 1] v, np.ndarray[np.int64_t, ndim = 1] w):
cdef long res
cdef int i
cdef int n
n = len(v)
if n > 0 and n == len(w):
res = 0
for i in range(0, n):
res += v[i] * w[i]
return res
Sometimes we need (or want) to pass some options to the compiler. The cython magic can take compile-args and link-args that are passed to the underlying compiler. Later, we'll show you how to integrate all of this in a setup.py script to compile your extension outside of a notebook.
The documentation shows you, which options can be passed to the magic command.
dot5a_timing = %timeit -o -r 13 dot(v,w)
Adding OpenMP
Since Cython generates compiled Python extensions, we can release the GIL and run things in parallel if we don't make calls to the Python API.
As we've seen our inner loop is free of any Python calls (the annotated code is white). Since OpenMP supports reductions, we can parallelize the loop using Cython's prange
. Within prange
we have to explicitly release the GIL by setting nogil=True
. We also need to pass the compiler and linker flags for OpenMP.
%%cython --compile-args=-fopenmp --compile-args=-O2 --compile-args=-march=native --compile-args=-w --link-args=-fopenmp --force
import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import parallel, prange
@cython.boundscheck(False)
@cython.wraparound(False)
def dot(np.ndarray[np.int64_t, ndim = 1] v, np.ndarray[np.int64_t, ndim = 1] w):
cdef long res
cdef int i
cdef int n
n = len(v)
if n > 0 and n == len(w):
res = v[0] * w[0]
for i in prange(1, n, nogil=True):
res += v[i] * w[i]
return res
# Let's limit the number of threads we use since the overhead for 256 threads would
# kill any chance of improved performance. You can play around with the number.
with threadpoolctl.threadpool_limits(16):
dot6_timing = %timeit -o dot(v,w)
Our data set is too small to benefit from parallelization. The overhead due to starting multiple threads is too large for this problem size.
Function declaration with def, cdef, and cpdef
In short, functions defined using def are visible from Python and treated like Python functions even within Cython. Functions defined with cdef are C functions and are not visible from Python at all. Functions declared with cpdef are C functions that can be called efficiently from within Cython but also have a Python wrapper so they can be accessed from Python.
Exercise
- Define a function
add
in Cython usingcdef
that takes two float values as arguments and returns the sum. - Try calling it from outside Cython
# Write your function here
Now use cpdef instead.
# Copy the function from above and use cpdef instead of cdef
Building a Cython extension outside of a notebook
So far we have used IPython and the Cython magic to build and test our extension within a notebook. Once we are satisfied and want to put our extension in production, we want to be able to build the extension without IPython. The recommended way to do that is to use the setuptools
provided with Cython and a setup.py
file. For details see the documentation.
Note that distutils has been marked as deprecated as of Python 3.10.
A simple setup.py file
from setuptools import setup
from Cython.Build import cythonize
setup(name="Sum of integers",
ext_modules=cythonize("sum.pyx"),
zip_safe=False
)
Adding compiler options
To add compiler options, for example, to compile with OpenMP, we need to set up an Extension first. Then, we use the Extension instead of the file name in the setup function. Our new setup.py file now looks like this:
from setuptools import Extension, setup
from Cython.Build import cythonize
ext_modules = [
Extension("sum",
["sum.pyx"],
extra_compile_args = ["-fopenmp"],
extra_link_args = ["-fopenmp"]
)
]
setup(name="Sum of integers",
ext_modules=cythonize(ext_modules),
zip_safe=False
)
Exercise: Take the Cython code that defines dot using prange
in Adding OpenMP and write it to dot.pyx
using the %%writefile
magic. Make sure to comment out the cython magic
. Take the above code for setup.py and copy it into a file called setup.py
. Change the setup.py code to build a module named dot and use dot.pyx
. Then build the extension in a terminal window with the command. Note: Make sure our environment is loaded source hpcpy21
.
python setup.py build_ext --inplace
If the build fails with #include "numpy/arrayobject.h" not found
, you need to add the include path for numpy. Luckily, numpy has a function for that: numpy.get_include()
. Add the include path to the extra_compile_args. Include paths are added using -I/path/to/be/included
. Since setup.py
is a Python script you can call numpy.get_include()
in the script and don't have to hardcode the path.
Write a test program that loads and tests the extension. Add a doc string to the dot function and include an example section like this:
def dot(...):
""" Add description and parameters ...
Examples
--------
>>> import numpy
>>> from dot import dot
>>> v = numpy.arange(0, 10, dtype=int)
>>> w = numpy.arange(0, 10, dtype=int)
>>> print(dot(v, w))
285
"""
Rebuild the extension and run doctest (python -m doctest -v dot.pyx
). Without -v
only failed tests are reported.
Comparison with Numba
Numba, can generate fast functions, too.
from numba import int64, jit
@jit(nopython=True)
def udot(v, w):
res = 0
for i in range(0, len(v)):
res += v[i] * w[i]
return res
from numba import int64, jit
@jit(nopython=False)
def udotg(v, w):
res = 0
for i in range(0, len(v)):
res += v[i] * w[i]
return res
numbadot_timing = %timeit -o udot(v,w)
numbadotg_timing = %timeit -o udotg(v,w)
%%timeit a = range(10000); b = range(10000)
udotg(a,b)
npdot_timing = %timeit -o np.dot(v,w)
tunits = {'ns':10**-9, 'us':10**-6, 'ms':10**-3, 's':1.0}
timers = [pydot_timing, dot1_timing, dot2_timing, dot3_timing, dot4_timing, dot5_timing,
dot6_timing, numbadot_timing, numbadotg_timing, npdot_timing]
times = np.array([t.best for t in timers])
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.bar(np.arange(len(times)) + 1, times * 1e6)
plt.ylabel(u"µs")
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],["Python", "Cython", "types1", "types2", "bounds", "wraparound", "parallel",
"Numba", "NumbaG", "np.dot"], rotation=45)
plt.subplot(1,2,2)
plt.bar(np.arange(2, len(times[2:]) + 2) + 1, times[2:] * 1e6)
plt.ylabel(u"µs")
plt.xticks([3, 4, 5, 6, 7, 8, 9, 10],["types1", "types2", "bounds", "wraparound", "parallel",
"Numba", "NumbaG", "np.dot"], rotation=45)
Finally, let's compare the performance for a larger data set. Remember the last version of our dot function uses OpenMP.
v = np.arange(10000000, dtype=int)
w = np.arange(10000000, dtype=int)
%timeit -n 1 pydot(v,w)
with threadpoolctl.threadpool_limits(16):
%timeit dot(v,w)
%timeit udot(v,w)
%timeit np.dot(v,w)
Cython and classes
Sometimes, we want to do more than just wrap a function. We might want an efficient data type that implements some operators, for example. For this Cython allows us to declare classes just like in Python:
%%cython
class Point:
"""A simple class representing a point in the x-y plane."""
def __init__(self, x, y):
self.x = x
self.y = y
def __repr__(self):
return "(%.1f, %.1f)" % (self.x, self.y)
p = Point(0,1)
print(p)
Extension types
There is a second type of classes called extension types. An extension type stores its members and methods in a C struct instead of a Python dictionary. This makes them more efficient but also more restrictive. Let's look at an example:
%%cython
from math import sqrt,asin,pi
cpdef double rad2deg(double alpha):
return 180 * alpha / pi
cdef class Point:
"""A point in 2d."""
cdef double x, y
cdef double distance2origin(Point self):
return sqrt(self.x * self.x + self.y * self.y)
cdef alpha(Point self):
return asin(self.y / self.distance2origin())
def __cinit__(self, t_x, t_y):
self.x = t_x
self.y = t_y
def __repr__(self):
return "(r=%.2f, alpha=%.2f°)" % (self.distance2origin(), rad2deg(self.alpha()))
The first thing to note is the definition using cdef class
. It's the reason extension types are also referred to as cdef classes. We can define functions that are only visible to C using cdef
and Python functions using def
(or both at once with cpdef
). For functions defined with cdef
, we need to give the type of self as well as a return type.
We can use the new class as follows:
p = Point(0.5, 1)
print(p)
Exercise
Try which methods of Point you can call.
Cython Tutorial at SciPy
The following video is from a half-day tutorial on Cython.
from IPython.display import YouTubeVideo
YouTubeVideo("gMvkiQ-gOW8")