{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Numba and GPUs\n",
    "\n",
    "<div class=\"dateauthor\">\n",
    "23 June 2022 | Jan H. Meinke\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# Let's ignore some deprecation warnings\n",
    "from numba.core.errors import NumbaDeprecationWarning, NumbaPendingDeprecationWarning, NumbaPerformanceWarning\n",
    "import warnings\n",
    "\n",
    "warnings.simplefilter('ignore', category=NumbaDeprecationWarning)\n",
    "warnings.simplefilter('ignore', category=NumbaPendingDeprecationWarning)\n",
    "warnings.simplefilter('ignore', category=NumbaPerformanceWarning)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Ufunc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "We already learned how to vectorize a function. Remember the Mandelbrot set. We defined a function that returns the number of iterations needed to decide if the algorithm diverges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from numba import vectorize\n",
    "\n",
    "def escape_time(p, maxtime):\n",
    "    \"\"\"Perform the Mandelbrot iteration until it's clear that p diverges\n",
    "    or the maximum number of iterations has been reached.\n",
    "    \"\"\"\n",
    "    z = 0j\n",
    "    for i in range(maxtime):\n",
    "        z = z ** 2 + p\n",
    "        if abs(z) > 2:\n",
    "            return i\n",
    "    return maxtime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import numpy\n",
    "x = numpy.linspace(-2, 2, 500)\n",
    "y = numpy.linspace(-1.5, 1.5, 375)\n",
    "zr, zc = numpy.meshgrid(x, 1j * y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "escape_time_v = vectorize(\"int64(complex128, int64)\", \n",
    "                          target=\"parallel\")(escape_time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "t_cpu = %timeit -o escape_time_v(zr + zc, 500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "If you replace `target=\"parallel\"` with `target=\"cuda\"` the function runs on the GPU instead. Give it a try and compare the performance for different sizes of the grid:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile mandelbrot_vectorize_cuda.ipy\n",
    "from numba import vectorize\n",
    "\n",
    "def escape_time(p, maxtime):\n",
    "    \"\"\"Perform the Mandelbrot iteration until it's clear that p diverges\n",
    "    or the maximum number of iterations has been reached.\n",
    "    \"\"\"\n",
    "    z = 0j\n",
    "    for i in range(maxtime):\n",
    "        z = z ** 2 + p\n",
    "        if abs(z) > 2:\n",
    "            return i\n",
    "    return maxtime\n",
    "\n",
    "if __name__ == \"__main__:\n",
    "    import numpy\n",
    "    x = numpy.linspace(-2, 2, 500)\n",
    "    y = numpy.linspace(-1.5, 1.5, 375)\n",
    "    zr, zc = numpy.meshgrid(x, 1j * y)\n",
    "    \n",
    "# TODO\n",
    "escape_time_g = Your code goes here\n",
    "\n",
    "t = %timeit -o escape_time_g(zr + zc, 500)\n",
    "print(t.all_runs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "res = !srun -p gpus -A training2219 ipython mandelbrot_vectorize_cuda.ipy\n",
    "t_gpu = numpy.array(eval(res[-1]))\n",
    "print(f\"Runtime: {t_gpu.mean():.3f}±{t_gpu.std():.3f} s.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## CUDA for Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "When `vectorize` is not enough, you can write CUDA programs in Python using Numba as well.\n",
    "\n",
    "While a complete introduction to CUDA is beyond the scope of this course---there are other courses for this, for example, GPU Programming with CUDA @ JSC and also many online resources available---here you'll get the nutshell version and some of the differences between CUDA C++ and CUDA Python."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## CPU vs. GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![A Playmobil motorcycle and a Playmobil bus](images/motorcycleandbus.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## CPU vs. GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "CPUs are optimized for latency."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "A CPU tries to execute a given instruction as quickly as possible, i.e., it tries to keep the latency (the time between issuing and executing an instruction) as short as possible. CPUs use caches and a lot of control logic to achieve this goal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "GPUs are optimized for throughput."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## CPU vs. GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "![Scheme of die usage on CPU and GPU](images/CPUGPUSchema.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "GPUs were (and are) made to display graphics on your screen. It doesn't matter how quickly a GPU can update a single pixel. It's important how quickly it can update all of the pixels on the screen (more than 2 million on an HD display). In addition it often must perform the same operation on a lot of vertices or pixels. \n",
    "\n",
    "These two conditions let to a different execution model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "## GPU execution model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "GPUs use *many* lightweight threads."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "GPUs hide latency instead of avoiding it"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "GPUs work best if the problem can me mapped on a grid, but other models are possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Kernels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "We can calculate the Mandelbrot set using `escape_time` like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "maxiter = 20\n",
    "rlim = (-2.2, 1.5)\n",
    "ilim = (-1.5, 1.5)\n",
    "nx = 100\n",
    "ny = 75\n",
    "\n",
    "dx = (rlim[1] - rlim[0]) / nx\n",
    "dy = (ilim[1] - ilim[0]) / ny\n",
    "\n",
    "M = numpy.zeros((ny, nx), dtype=int)\n",
    "\n",
    "for i in range(ny):\n",
    "    for j in range(nx):\n",
    "        p = rlim[0] + j * dx + (ilim[0] + i * dy) * 1j\n",
    "        M[i, j] = escape_time(p, maxiter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.imshow(M, interpolation=\"nearest\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Notice that for every pair (i, j), we calculate the escape time. This makes\n",
    "\n",
    "```python\n",
    "        p = rlim[0] + j * dx + (ilim[0] + i * dy) * 1j\n",
    "        M[i, j] = escape_time(p, maxiter)\n",
    "```\n",
    "\n",
    "our kernel that we execute on the grid spanned by the two for loops. It's quite apparent on the image above, where the coordinates for each value are shown in index space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Threads and blocks and grids, oh my!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The basic unit of execution is a thread. All the threads of a kernel call execute the same code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "![A single tread](images/a_single_thread.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Thread blocks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Threads are organized in blocks. All threads in a block are executed on the same *streaming multiprocessor* (SM, comparable to a core on a CPU). They can share fast *shared memory* and synchronize easily. A thread block is pinned to the SM, i.e., once it was launched on a particular SM it stays there.\n",
    "\n",
    "A thread block can have one, two, or three dimensions to make mapping the problem to the thread model easier. The choice is up to the programmer.The maximum dimension of a thread block is 1024x1024x64, where the volume of the thread block (the number of threads in the block) must be 1024 or less."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "![A block of threads](images/a_block_of_threads.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Grid of thread blocks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "![A grid of 5 blocks with 10 threads each](images/teams_with_threads.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Thread blocks are organized in a grid to cover all the threads needed to solve you problem. A grid can have one, two, or three dimensions. The dimensionality of the grid is in principle independent of the dimensionality of the blocks it contains. The maximum grid dimensions are 2147483647྾65535྾65535 blocks!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Thread ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Each thread has an x, y, and z index that can be queried in a kernel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "``` python\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def my_kernel:\n",
    "    x,y,z = cuda.grid(3) # use 2 if you only need x, y and 1 for x\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Writing a kernel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "For writing a CUDA kernel, we use `cuda.jit` instead of `jit` as decorator (see above). The kernel can also call other functions that have been decorated with `cuda.jit(device=True)`. A Mandelbrot kernel could look like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```python\n",
    "from numba import cuda\n",
    "\n",
    "escape_time_gpu = cuda.jit(device=True)(escape_time)\n",
    "\n",
    "@cuda.jit\n",
    "def mandelbrot(M, real_min, real_max, imag_min, imag_max):\n",
    "    \"\"\"Calculate the Mandelbrot set on the GPU.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    M : numpy.ndarray\n",
    "        a two-dimensional integer array that will contain the \n",
    "        escape times for each point.\n",
    "    real_min, real_max: float\n",
    "        minimum and maximums value on the real axis\n",
    "    imag_min, imag_max: float\n",
    "        minimum and maximum value on the imaginary axis\n",
    "    \"\"\"\n",
    "    ny, nx = M.shape\n",
    "    i, j = cuda.grid(2)\n",
    "    \n",
    "    if i < nx and j < ny:\n",
    "        dx = (real_max - real_min) / nx\n",
    "        dy = (imag_max - imag_min) / ny\n",
    "        p = real_min + dx * i + (imag_min + dy * j) * 1j\n",
    "        M[j, i] = escape_time_gpu(p, 20)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Note, that there is no return value. CUDA kernels have to return their result through an argument. In the above kernel, the result will be in M"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Calling a kernel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "CUDA needs to know how many blocks and how many threads per block you want to launch. This is called the *launch configuration*. To calculate the Mandelbrot set with of 1024 by 1024 points with a block size of 32 by 32, we need 1024/32=32 by 1024/32=32 blocks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```python\n",
    "M = numpy.zeros((1024, 1024), dtype=numpy.int32)\n",
    "block = (32, 32)\n",
    "grid = (M.shape[0] // block[0], M.shape[1] // block[1])\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The launch configuration is passed in square brackets before the function arguments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```\n",
    "mandelbrot_gpu[grid, block](M, -2.2, 1.2, -1.6, 1.6)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "If the dimension of your system is not a *multiple of your block size*, you need to add an extra block. The following code snippet takes care of this. (There are other ways to do this, too.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "```python\n",
    "M = numpy.zeros((1000, 1000), dtype=numpy.int32)\n",
    "block = (32, 32)\n",
    "grid = (M.shape[0] // block[0] if M.shape[0] % block[0] == 0 \n",
    "            else M.shape[0] // block[0] + 1,\n",
    "        M.shape[1] // block[1] if M.shape[1] % block[1] == 0 \n",
    "            else M.shape[1] // block[1] + 1)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Exercise: Profile the Mandelbrot calculation on the GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "a) Use %timeit to measure the speed of the GPU implementation and compare it to the vectorized version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile cuda_mandelbrot1.ipy\n",
    "import numpy\n",
    "\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit(device=True)\n",
    "def escape_time(p, maxtime):\n",
    "    \"\"\"Perform the Mandelbrot iteration until it's clear that p diverges\n",
    "    or the maximum number of iterations has been reached.\n",
    "    \"\"\"\n",
    "    z = 0j\n",
    "    for i in range(maxtime):\n",
    "        z = z ** 2 + p\n",
    "        if abs(z) > 2:\n",
    "            return i\n",
    "    return maxtime\n",
    "\n",
    "\n",
    "@cuda.jit\n",
    "def mandelbrot(M, real_min, real_max, imag_min, imag_max):\n",
    "    \"\"\"Calculate the Mandelbrot set on the GPU.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    M : numpy.ndarray\n",
    "        a two-dimensional integer array that will contain the \n",
    "        escape times for each point.\n",
    "    real_min: float\n",
    "        minimum value on the real axis\n",
    "    imag_min: float\n",
    "        minimum value on the imaginary axis\n",
    "    dx, dy: float\n",
    "        step size along the real and imaginary axes.\n",
    "    \"\"\"\n",
    "    ny, nx = M.shape\n",
    "    i, j = cuda.grid(2)\n",
    "    \n",
    "    if i < nx and j < ny:\n",
    "        dx = (real_max - real_min) / M.shape[1]\n",
    "        dy = (imag_max - imag_min) / M.shape[0]\n",
    "        p = real_min + dx * i + (imag_min + dy * j) * 1j\n",
    "        M[j, i] = escape_time(p, 20)\n",
    "        \n",
    "if __name__ == \"__main__\":\n",
    "    M = numpy.zeros((1024, 1024), dtype=numpy.int32)\n",
    "    block = (32, 32)\n",
    "    grid = (M.shape[0] // block[0], M.shape[1] // block[1])\n",
    "    real_min, real_max = (-2.2, 1.2)\n",
    "    imag_min, imag_max = (-1.6, 1.6)\n",
    "\n",
    "    #TODO\n",
    "    t = %timeit -o Your code goes here\n",
    "    print(t.all_runs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "res = !srun -p gpus -A training2219 ipython cuda_mandelbrot1.ipy\n",
    "t_gpu = numpy.array(eval(res[-1]))\n",
    "print(f\"Runtime: {t_gpu.mean() * 1000:.3f}±{t_gpu.std() * 1000:.3f} ms.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "b) The kernel calculates dx and dy for every pixel although it is the same for all of them. Change the kernel so that it takes dx and dy as arguments and calculate dx and dy before you call the kernel. Does this improve the performance?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile cuda_mandelbrot2.ipy\n",
    "import numpy\n",
    "\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit(device=True)\n",
    "def escape_time(p, maxtime):\n",
    "    \"\"\"Perform the Mandelbrot iteration until it's clear that p diverges\n",
    "    or the maximum number of iterations has been reached.\n",
    "    \"\"\"\n",
    "    z = 0j\n",
    "    for i in range(maxtime):\n",
    "        z = z ** 2 + p\n",
    "        if abs(z) > 2:\n",
    "            return i\n",
    "    return maxtime\n",
    "\n",
    "\n",
    "@cuda.jit\n",
    "def mandelbrot(M, real_min, real_max, imag_min, imag_max, dx, dy):\n",
    "    \"\"\"Calculate the Mandelbrot set on the GPU.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    M : numpy.ndarray\n",
    "        a two-dimensional integer array that will contain the \n",
    "        escape times for each point.\n",
    "    real_min: float\n",
    "        minimum value on the real axis\n",
    "    imag_min: float\n",
    "        minimum value on the imaginary axis\n",
    "    dx, dy: float\n",
    "        step size along the real and imaginary axes.\n",
    "    \"\"\"\n",
    "    Your code goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "res = !srun -p gpus -A training2219 ipython cuda_mandelbrot2.ipy\n",
    "t_gpu = numpy.array(eval(res[-1]))\n",
    "print(f\"Runtime: {t_gpu.mean() * 1000:.3f}±{t_gpu.std() * 1000:.3f} ms.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "c) Add an additional argument `maxtime` to the kernel, so that you can time the kernel for different escape time values. Don't forget to add the new argument to the documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile cuda_mandelbrot3.ipy\n",
    "import numpy\n",
    "\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit(device=True)\n",
    "def escape_time(p, maxtime):\n",
    "    \"\"\"Perform the Mandelbrot iteration until it's clear that p diverges\n",
    "    or the maximum number of iterations has been reached.\n",
    "    \"\"\"\n",
    "    z = 0j\n",
    "    for i in range(maxtime):\n",
    "        z = z ** 2 + p\n",
    "        if abs(z) > 2:\n",
    "            return i\n",
    "    return maxtime\n",
    "\n",
    "\n",
    "@cuda.jit\n",
    "def mandelbrot Your code goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "res = !srun -p gpus -A training2219 ipython cuda_mandelbrot3.ipy\n",
    "t_gpu = numpy.array(eval(res[-1]))\n",
    "print(f\"Runtime: {t_gpu.mean() * 1000:.3f}±{t_gpu.std() * 1000:.3f} ms.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Explicit memory management"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Discrete GPUs have their own memory. Data needs to be transferred from the host and results have to transferred back to the host as necessary. Numba takes care of the memory transfers but it does so in a very conservative way: all data is transferred back to the host after a kernel finished. This behavior can be avoided by [managing memory explicitely](http://numba.pydata.org/numba-doc/latest/cuda/memory.html#data-transfer)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Often, we want to copy an existing array to the GPU. For this we can use `numba.cuda.to_device`. This will allocate memory on the GPU and copy data from the CPU to the GPU, for example,"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "```Python\n",
    "l = numpy.linspace(0, 10, 100)\n",
    "d_l = cuda.to_device(l)\n",
    "# Run kernel\n",
    "...\n",
    "# Copy data back\n",
    "l = d_l.copy_to_host()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "You can allocate an empty array on the GPU using `numba.cuda.device_array`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```Python\n",
    "n = 1000\n",
    "d_random = cuda.device_array(n)\n",
    "# Run kernel\n",
    "...\n",
    "# Copy data back\n",
    "myRandom = d_random.copy_to_host()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Exercise: Use explicitly managed memory for Mandelbrot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The M array is transferred to the GPU before the kernel runs. This is not necessary. Use a device array instead. Compare the runtime to the previous version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile cuda_mandelbrot4.ipy\n",
    "import numpy\n",
    "import warnings\n",
    "\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit(device=True)\n",
    "def escape_time(p, maxtime):\n",
    "    \"\"\"Perform the Mandelbrot iteration until it's clear that p diverges\n",
    "    or the maximum number of iterations has been reached.\n",
    "    \"\"\"\n",
    "    z = 0j\n",
    "    for i in range(maxtime):\n",
    "        z = z ** 2 + p\n",
    "        if abs(z) > 2:\n",
    "            return i\n",
    "    return maxtime\n",
    "\n",
    "\n",
    "@cuda.jit\n",
    "def mandelbrot(M, real_min, real_max, imag_min, imag_max, dx, dy, maxtime=20):\n",
    "    \"\"\"Calculate the Mandelbrot set on the GPU.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    M : numpy.ndarray\n",
    "        a two-dimensional integer array that will contain the \n",
    "        escape times for each point.\n",
    "    real_min: float\n",
    "        minimum value on the real axis\n",
    "    imag_min: float\n",
    "        minimum value on the imaginary axis\n",
    "    dx, dy: float\n",
    "        step size along the real and imaginary axes.\n",
    "    maxtime: int\n",
    "        maximum number of iterations before the point is considered part of\n",
    "        the Mandelbrot set.\n",
    "    \"\"\"\n",
    "    ny, nx = M.shape\n",
    "    i, j = cuda.grid(2)\n",
    "    \n",
    "    if i < nx and j < ny:\n",
    "        p = real_min + dx * i + (imag_min + dy * j) * 1j\n",
    "        M[j, i] = escape_time(p, maxtime)\n",
    "        \n",
    "if __name__ == \"__main__\":\n",
    "    # TODO\n",
    "    M = Your code goes here\n",
    "    \n",
    "    t = %timeit -o mandelbrot[grid, block](M, -2.2, 1.2, -1.6, 1.6, dx, dy, maxtime)\n",
    "    print(t.all_runs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "res = !srun -p gpus -A training2219 ipython cuda_mandelbrot4.ipy\n",
    "t_gpu = numpy.array(eval(res[-1]))\n",
    "print(f\"Runtime: {t_gpu.mean() * 1000:.3f}±{t_gpu.std() * 1000:.3f} ms.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Exercise: Matrix multiplication"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Write a kernel, where each thread calculates one element of the result matrix $C=AB$, where A and B are matrices. Compare the result from the CPU and the GPU using `numpy.allclose` and write an appropriate message at the end. \n",
    "\n",
    "For the algorithm, you can take a look at Bottlenecks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%writefile cuda_matrixmul.py\n",
    "import numpy\n",
    "import warnings\n",
    "from numba import cuda\n",
    "from numba.core.errors import NumbaPerformanceWarning\n",
    "\n",
    "# TODO\n",
    "Your code goes here\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!srun -p gpus -A training2219 python cuda_matrixmul.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Using shared memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "As you learned in Bottlenecks, the matrix matrix multiplication tends to be memory bandwidth bound. This is true on the GPU, too.\n",
    "\n",
    "The way to make it faster is to use faster memory. On a CPU this usually means, dividing the matrix into blocks that fit in cache and hope for the best. On a GPU at lease part of the fast memory is usually programmable. In CUDA this memory is called *shared memory*.\n",
    "\n",
    "Shared memory is available to all *threads in a thread block*. Usually, each thread loads data from device memory into shared memory. This is followed by barrier, so that all threads are finished reading. Then the shared memory is reused as often as possible. Another barrier makes sure that all threads are done."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Let's look at an example:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Matrix multiplication with shared memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The following function performs a matrix multiplication on the GPU using shared memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```python\n",
    "from numba import cuda, float32\n",
    "# Controls threads per block and shared memory usage.\n",
    "# The computation will be done on blocks of TPBxTPB elements.\n",
    "TPB = 16\n",
    "\n",
    "@cuda.jit\n",
    "def fast_matmul(A, B, C):\n",
    "    # Define an array in the shared memory\n",
    "    # The size and type of the arrays must be known at compile time\n",
    "    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)\n",
    "    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)\n",
    "\n",
    "    x, y = cuda.grid(2)\n",
    "    tx = cuda.threadIdx.x\n",
    "    ty = cuda.threadIdx.y\n",
    "    bpg = cuda.gridDim.x    # blocks per grid\n",
    "\n",
    "    if x >= C.shape[0] and y >= C.shape[1]:\n",
    "        # Quit if (x, y) is outside of valid C boundary\n",
    "        return\n",
    "\n",
    "    # Each thread computes one element in the result matrix.\n",
    "    # The dot product is chunked into dot products of TPB-long vectors.\n",
    "    tmp = 0.\n",
    "    for i in range(bpg):\n",
    "        # Preload data into shared memory\n",
    "        sA[tx, ty] = A[x, ty + i * TPB]\n",
    "        sB[tx, ty] = B[tx + i * TPB, y]\n",
    "\n",
    "        # Wait until all threads finish preloading\n",
    "        cuda.syncthreads()\n",
    "\n",
    "        # Computes partial product on the shared memory\n",
    "        for j in range(TPB):\n",
    "            tmp += sA[tx, j] * sB[j, ty]\n",
    "\n",
    "        # Wait until all threads finish computing\n",
    "        cuda.syncthreads()\n",
    "\n",
    "    C[x, y] = tmp\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The above example shows most special functions used in CUDA kernels:\n",
    "\n",
    "All the cuda specific API is found in numbas cuda module (line 1) \n",
    "\n",
    "CUDA kernels are defined like regular Python functions with the added decorator `@cuda.jit` (line 7). The decorator makes sure that the function is compiled for the GPU.\n",
    "\n",
    "As mentioned above, CUDA kernels are executed as a grid of blocks of threads. cuda.grid (line 14) returns the global indices of the current thread, e.g., `x, y = cuda.grid(2)` for a two dimensional grid. The argument gives the number of dimensions (1, 2, 3). This function does not exist in CUDA C++.\n",
    "\n",
    "In CUDA C++, the programmer usually calculates the global index from the thread index `threadIdx`, the block index `blockIdx`, and the size (dimension) of the block stored in `blockDim`\n",
    "\n",
    "```C++\n",
    "int x = blockIdx.x * blockDim.x + threadIdx.x;\n",
    "```\n",
    "\n",
    "In numba these are available through the cuda module, so that you can rewrite the C++ code above as \n",
    "\n",
    "```Python\n",
    "x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n",
    "```\n",
    "\n",
    "In `fast_matmul` these functions are used to get the local index of the thread, which is important to use the shared memory (lines 16--18).\n",
    "\n",
    "Shared memory is a programmable cache accessible to all threads within a block. In C++ it is allocated within a kernel as\n",
    "\n",
    "```C++\n",
    "__shared__ float sA[TPB, TPB]; // Allocate a 2D array of floats of size TPB\n",
    "```\n",
    "\n",
    "The cuda module provides shared.array (line 11 & 12) to do the same thing.\n",
    "\n",
    "The last function from the cuda module used in `fast_matmul` is cuda.syncthreads(), which implements a barrier for the threads within a block. It corresponds to `__syncthreads()` in CUDA C++.\n",
    "\n",
    "These functions cover most kernels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Not all CUDA features are implemented in Numba. Some missing features are listed at http://numba.pydata.org/numba-doc/dev/cuda/overview.html#missing-cuda-features. Currently they include dynamic parallelism and texture memory."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "HPC Python 2022",
   "language": "python",
   "name": "hpcpy22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}