{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Introduction to CuPy\n",
    "<div class=\"dateauthor\">\n",
    "10 June 2021 | Jan H. Meinke\n",
    "</div>\n",
    "<img src=\"images/cupy.png\" style=\"float:right\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## CuPY\n",
    "### A NumPy-like interface to GPU programming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> The best way to program a GPU is to let other people do the work!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "[CuPy][] provides a [NumPy][]-like interface to use and program GPUs.\n",
    "\n",
    "[CuPy]: https://cupy.dev/\n",
    "[NumPy]: https://numpy.org/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import cupy\n",
    "N = 2048\n",
    "A = cupy.random.random((N, N))\n",
    "B = cupy.random.random((N, N))\n",
    "C = A@B"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "This cell creates two random arrays of size N by N on the GPU and performs a matrix multiplication using Nvidia's optimized linar algebra library cuBLAS."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Exercise\n",
    "In [Think Vector][TV], you [calculated the Mandelbrot set][TV_Mandelbrot] using [NumPy][] and vectorization. Take either your solution or ours and convert it to [CuPy][]. Visualize the result.\n",
    "\n",
    "Tip: If you get an error message when visualizing the results, take a look [below](#CuPy-Arrays).\n",
    "\n",
    "[CuPy]: https://cupy.dev/\n",
    "[NumPy]: https://numpy.org/\n",
    "[TV]: ThinkVector.ipynb\n",
    "[TV_Mandelbrot]: ThinkVector.ipynb#Programming-exercise-Mandelbrot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## GPU Libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* cuBLAS\n",
    "* cuDNN\n",
    "* cuRand \n",
    "* cuSolver\n",
    "* cuSPARSE\n",
    "* cuFFT\n",
    "* NCCL "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "CuPy is the fastest and easiest way to use Nvidia's GPU libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Exercise\n",
    "a) Time the execution of C=A@B for different matrix sizes, e.g., 256, 512, 1024, 2048, 4096. Calculate the performance in GFLOP/s using gflops = 2e-9 * N ** 3 / t and store the sizes and the times in an ndarray.\n",
    "\n",
    "b) Do the same using numpy. How do the numbers compare."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## CuPy Arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "CuPy arrays live on the GPU. To retrieve them you can use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "A = cupy.random.random((N, N))\n",
    "A1_np = A.get()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "A2_np = cupy.asnumpy(A)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "The first command only works for GPU arrays, but the second can also be used for a NumPy array. If A is a GPU array, the data will be copied from the CPU to the GPU. If A is a NumPy array, no copy will be made and A2_np becomes a reference to A."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import numpy\n",
    "x = numpy.linspace(0, 1, 10)\n",
    "x_gpu = cupy.asarray(x)  # Copy x to the GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "To transfer data to the GPU use `cupy.asarray`. This doesn't just work with NumPy arrays, but also, for example, with lists."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "### Exercise\n",
    "Create a 2D array of random number on the GPU. Transfer it to the CPU. Subtract 0.5 from all elements. Copy the result back to the GPU. Clculate the average value of all elements on the GPU and write the result to the screen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Picking a Device"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "CuPy always works on the *current device*. On multi-GPU nodes this can be changed using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "cupy.cuda.Device(1).use()\n",
    "A_on_gpu1 = cupy.random.random((N, N))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "If you need to switch devices regularly, you can use a `with` statement:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "with cupy.cuda.Device(2):\n",
    "    A_on_gpu2 = cupy.random.random((N, N))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Note, that you can only access `A_on_gpu2` while device 1 is active. Otherwise, you'll get an error message."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "HPC Python 2021",
   "language": "python",
   "name": "hpcpy21"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}