{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Numba's jit compiler\n", "\n", "<div class=\"dateauthor\">\n", "08 June 2021 | Jan H. Meinke\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Numba provides a just-in-time (jit) compiler, a decorator `vectorize` that we can use to define `ufunc`s that are fast and flexible, and an interface to CUDA- and ROCm-capable GPUs that allows us to write CUDA kernels in Python! In this notebook, we'll focus on the jit compiler." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Python is an interpreted language" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "The Python interpreter parses and executes code when it encounters it. This makes it very flexible but it also precludes many optimizations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy\n", "from numba import jit\n", "from matplotlib import pyplot as plt " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## A simple example\n", "Let's start with a simple sum. In Python we may define the sum like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def python_sum(a, start=0):\n", " res = start\n", " for x in a:\n", " res += x\n", " return res" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "When we call `python_sum`, the interpreter goes through it line by line. For each item it has to interpret `res += x` and execute it, i.e., call apropriate C routines that have been compiled for the processor. The only requirements for `a` in this function are that it is iterable and its elements support the `+` operator. For the following little benchmark, we'll use an `ndarray` of random numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "rng = numpy.random.Generator(numpy.random.MT19937())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "a = rng.uniform(size=10000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "t_python_sum = %timeit -o python_sum(a)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Please calculate the floating point operations per second for `python_sum`. Btw., remember the peak performance of a single core on JUWELS is about 40 GFLOP/s." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Compiled languages" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "In compiled languages such as C, C++, Fortran, and Rust a compiler translates the code once \n", "and stores the results in machine code for a particular processor. It doesn't have to translate it \n", "literally, but can look for optimization and map the work optimally to the capabilities of the \n", "processor. That's what makes this much faster but also less flexible. In C++ the sum may be written \n", "like this:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```cpp\n", "#include <vector>\n", "template <class T>\n", "auto cpp_sum(std::vector<T>& a, T start){\n", " auto res = start;\n", " for (auto x : a){\n", " res += x;\n", " }\n", " return res;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "With C++20 you could get even closer to the functionality of `python_sum`, but this will do for now.\n", "\n", "A part of the assembler code generated for the loop might look something like this:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```nasm\n", "..B1.14: # Preds ..B1.14 ..B1.13\n", " vaddpd ymm3, ymm3, YMMWORD PTR [rdi+rcx*8] #7.9\n", " vaddpd ymm2, ymm2, YMMWORD PTR [32+rdi+rcx*8] #7.9\n", " vaddpd ymm1, ymm1, YMMWORD PTR [64+rdi+rcx*8] #7.9\n", " vaddpd ymm0, ymm0, YMMWORD PTR [96+rdi+rcx*8] #7.9\n", " add rcx, 16 #6.19\n", " cmp rcx, rax #6.19\n", " jb ..B1.14 # Prob 82% #6.19\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Yes, there are good reasons to love Python (and other higher programming languages).\n", "\n", "Let's run the code:\n", "```\n", "$ ./simple_sum\n", "Sum: 5033.24 in 0.717281 µs. 13941.5 MFLOP. \n", "```\n", "\n", "The function takes about 0.7 µs. This is about 2000 times faster than the interpreted Python loop. \n", "Wouldn't it be great if we could take the Python code in `python_sum` and compile it to machine \n", "code to get some of this speedup?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Compiling a program to machine code involves at least two phases. In the first phase the human readable \n", "program is translated into something that is more palable to a computer such as an \n", "[abstract syntax tree][AST] or [AST][]. Many compilers then perform a first analysis and pattern \n", "matching on the [AST][] before machine specific optimization are applied.\n", "\n", "A popular open source framework for these steps is [LLVM][]. LLVM provides both the language parser, e.g., \n", "clang for C and clang++ for C++ and a toolchain for the language independent optimization steps. The \n", "language parser generates a well defined intermediate representation that can than be optimized. \n", "\n", "[Numba][] uses this infrastructure to optimize a subset of Python to machine code during program \n", "execution. These kind of live compiler are called just-in-time (JIT) compiler. \n", "\n", "[AST]: https://en.wikipedia.org/wiki/Abstract_syntax_tree\n", "[LLVM]: http://llvm.org/\n", "[Numba]: https://numba.pydata.org/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Python_sum compiled" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Let's see how well [Numba][] does with the sum. We imported `numba.jit` at the top of the notebook.\n", "The function `jit` returns a `callable`---which makes it usable as a `decorator` as well---that we can \n", "assign to a variable. The variable can then be used just like a regular function:\n", "\n", "[AST]: https://en.wikipedia.org/wiki/Abstract_syntax_tree\n", "[LLVM]: http://llvm.org/\n", "[Numba]: https://numba.pydata.org/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "numba_sum = jit(python_sum)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "%timeit -n 1 -r 1 numba_sum(a)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The first time a \"jitted\" function is called with a specific argument type, numba compiles the code, which takes fairly long. Future calls are much faster:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "t_jit_sum = %timeit -o numba_sum(a)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "This is quite an impressive speed up although not quite as fast as the compiled C++ code. Please, \n", "calculate the performance again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Let's compare the performance with numpy's `sum`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "t_numpy_sum = %timeit -o numpy.sum(a)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "It's a little faster." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "print(f\"Numpy's sum achieves {1e-6 * a.shape[0] / t_numpy_sum.best:.3f} MFLOP/s\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A more complex example and nopython\n", "Numba likes simple expressions with simple loops:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def mm(a,b):\n", " res = numpy.zeros((a.shape[0], b.shape[1]))\n", " for row in range(a.shape[0]):\n", " for col in range(b.shape[1]):\n", " for k in range(a.shape[1]):\n", " res[row, col] += a[row, k] * b[k,col]\n", " return res" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "a = rng.uniform(size=(100, 100))\n", "b = rng.uniform(size=(100, 100))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "t_python_mm = %timeit -o mm(a,b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "t_numpy_mm = %timeit -o a.dot(b)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "OK, the Python loop is about 30000 times slower than numpy's `dot` method. Let's see if we can't make this faster using numba. This time, we'll use `jit` as a decorator." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "@jit\n", "def numba_mm(a,b):\n", " res = numpy.zeros((a.shape[0], b.shape[1]))\n", " for row in range(a.shape[0]):\n", " for col in range(b.shape[1]):\n", " for k in range(a.shape[1]):\n", " res[row, col] += a[row, k] * b[k,col]\n", " return res" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%timeit -n 1 -r 1 numba_mm(a,b) # Warmup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "t_numba_mm = %timeit -o numba_mm(a,b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "a = rng.uniform(size=(1000, 1000))\n", "b = rng.uniform(size=(1000, 1000))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "t_numba_mm_1000 = %timeit -o numba_mm(a,b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "t_numpy_mm_1000 = %timeit -o a.dot(b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "print(\"The version of numpy, we used has been compiled against Intel's math kernel library (MKL) and is therefore about %.0f times faster. \" \n", " \"If we used a version that has not been compiled against a fast BLAS library, it would take about the same time as the numba routine.\" \n", " % (t_numba_mm_1000.best / t_numpy_mm_1000.best))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Exercise: prange\n", "Numba can parallelize loops with ``prange``. Import ``prange`` from numba and change the range in row into a prange. You also need to add the arguments ``nopython=True`` and ``parallel=True`` to the jit decorator.\n", "\n", "Rerun and compare.\n", "\n", "**Extra credit:** Look at the changes we did to the matrix-matrix multiplication in the Bottlenecks notebook. Can you include them, too. Does it help? You might want to compare the results of your own calculation with that of numpy. The function ``numpy.allclose`` is useful for that." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Some other variations\n", "\n", "Let's try some other variations and how they do." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "@jit\n", "def numba_mm2(a,b):\n", " res = numpy.zeros((a.shape[0], b.shape[1]))\n", " for row in range(a.shape[0]):\n", " for col in range(b.shape[1]):\n", " res[row, col] = a[row].dot(b[:,col])\n", " return res" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%timeit -r 1 -n 1 numba_mm2(a,b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "t_numba_mm2_1000 = %timeit -o numba_mm2(a,b)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Most of the computation should now be done with numpy, but it's much harder to optimize a dot product than the matrix multiplication. Think back to our discussion about bottlenecks. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "@jit(nopython=True)\n", "def numba_mm3(a, b):\n", " res = numpy.zeros((a.shape[0], b.shape[1]))\n", " for row in range(a.shape[0]):\n", " for col in range(b.shape[1]):\n", " res[row, col] = a[row]@b[:,col]\n", " return res" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%timeit -r 1 -n 1 numba_mm3(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "@jit(nopython = True)\n", "def numba_mm4(a, b):\n", " res = numpy.zeros((a.shape[0], b.shape[1]))\n", " for row in range(a.shape[0]):\n", " for col in range(b.shape[1]):\n", " for k in range(a.shape[1]):\n", " res[row, col] += a[row, k] * b[k,col]\n", " return res" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "numba_mm4(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "t_numba_mm4_1000 = %timeit -o numba_mm4(a, b)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "tags": [ "20min" ] }, "source": [ "## Exercise: Mandelbrot with Numba\n", "Now, it's your turn again. Use Numba's jit operator to write a program to calculate the Mandelbrot set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Type inference" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "As mentioned above, [Numba][]'s jit operator doesn't directly return a compiled function, because \n", "it needs to figure out the type of every variable once a function has been called with specific \n", "arguments. It only compiles a function once for each type of arguments used to call it. The function\n", "signature is then stored in its `signatures` property.\n", "\n", "[Numba]: https://numba.pydata.org/" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "numba_sum.signatures" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Exercise: Call numba_sum with an ndarray of type float32" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Now there is a second function signature:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "numba_sum.signatures" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Numba allows us to look at its type inferrence using " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "numba_sum.inspect_types(signature=numba_sum.signatures[1])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Now, this is interesting. If you look at Line 3 of the version called with float32, it still defines\n", "`res` as a double precision number! This will prevent it from vectorizing the loop using single \n", "precision arguments, which potentially cuts performance in half!\n", "\n", "Now, let's see if we can do better." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "numba_sum(a_s, numpy.float32(0.0))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "numba_sum.signatures" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "numba_sum.inspect_types(signature=numba_sum.signatures[2]) # Try adding pretty=True " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Now, res is of type `float32`. Let's see if this speeds up things." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%timeit numba_sum(a_s, numpy.float32(0.0))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Doesn't look like it. \n", "\n", "Let's dig a little deeper. A speedup would come from the fact that the Skylake-X processor used for \n", "JUWELS Cluster can operate on 16 single precision numbers at once compared to 8 double precision \n", "numbers, but that assumes it's using the right instructions. For that we have to look at the assembler.\n", "\n", "We define a helper function to find instructions in the assembler code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# This function has been borrowed from https://github.com/numba/numba-examples/blob/master/notebooks/simd.ipynb\n", "def find_instr(func, keyword, sig=0, limit=5):\n", " count = 0\n", " for l in func.inspect_asm(func.signatures[sig]).split('\\n'):\n", " if keyword in l:\n", " count += 1\n", " print(l)\n", " if count >= limit:\n", " break\n", " if count == 0:\n", " print('No instructions found')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "find_instr(numba_sum, \"add\", 2, limit=10)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "The instruction `vaddss` performs an addition of two single precision numbers. It's not vectorized \n", "and uses the smallest register available. For some reason the compilation wasn't done using the best\n", "available instruction set.\n", "\n", "The reason in this case is that we are doing a *reduction*. We are adding up all the values in `a`.\n", "When the compiler vectorizes this, the order of the addition may change. While this is not a problem\n", "if we work with real number, it is a problem for the finite precision math that computers do. The \n", "order of the addition may change the result.\n", "\n", "To allow reordering of operations, we add the option `fastmath=True`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "fast_numba_sum = jit(python_sum, fastmath=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%timeit fast_numba_sum(a_s, numpy.float32(0.0))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Now let's look at the assembler again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "find_instr(fast_numba_sum, \"vaddp\", sig=0, limit=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is much better. The `ps` at the end of `vaddps` stands for *packed single precision* indicating \n", "a SIMD instruction. The `ymm` registers used are 256 bits wide, which corresponds to 8 single precision\n", "numbers at a time.\n", "\n", "Skylake-X also has `zmm` registers with a width of 512 bit or 16 single precision numbers, but when\n", "they are used the maximum frequency of the processor is reduced. It can happen that the performance \n", "using `ymm` registers at higher frequency is actually better." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Exercise: Call fast_numba_sum with double precision\n", "\n", "Call `fast_numba_sum` with arguments of type float64 and measure the run time. Compare the result\n", "with the one obtained from `numba_sum`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "print(f\"The difference is {fast_numba_sum(a, numpy.float64(0.0)) - numba_sum(a, numpy.float64(0.0))}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "HPC Python 2021", "language": "python", "name": "hpcpy21" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 4 }