{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# NumPy - an HPC perspective\n",
    "\n",
    "<div class=\"dateauthor\">\n",
    "20 June 2022 | Olav Zimmermann\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Python is an interpreted language and as such it is extremely flexible, allowing to define everything, including code itself, \n",
    "at runtime. This entails that the Python interpreter uses a lot of magic behind the scenes (e.g. type inferencing) to keep things\n",
    "as simple and productive as possible for the programmer. This flexibility comes at the price of a markedly reduced runtime speed \n",
    "compared to compiled languages such as C++.\n",
    "Many problems encoded in Python do not require the full range of flexibility and could easily trade it for improved runtime \n",
    "performance without giving up the development performance which is the key for Python's success."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Background"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "NumPy is a library, written in C, to enable fast numerical computing in Python. Since its inception in 1995, NumPy has become the first choice for numeric data processing in python. As numpy derives its speed from SIMD parallel processing, numpy requires containers with uniform memory layout, i.e., arrays that contain items of uniform static datatype."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Despite this apparently severe restriction NumPy is quite versatile and a full introduction into NumPy would require a\n",
    "course of its own. Hence we will focus on the concepts that allowed its wide adoption by developers and HPC vendors alike. For an overview of the available functionality the numpy website https://www.numpy.org is a good starting point."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "**Note:** NumPy is not a part of the vanilla Python distribution and needs to be installed separately. Then it can be used  like any other Python library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## The numpy BLAS and LAPACK bindings\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "NumPy does not implement everything itself. Instead it uses the fact that with [BLAS][] (low level routines), and [LAPACK][] (high level routines, based on BLAS), two widely adopted interfaces for linear algebra routines exist and that for almost all distributions and operation systems at least one library supporting the BLAS/LAPACK interface exists.\n",
    "\n",
    "[BLAS]: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms\n",
    "[LAPACK]: https://en.wikipedia.org/wiki/LAPACK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "While not mandatory, NumPy can be linked against high performance linear algebra libraries such as Intel's MKL, AMD's BLIS or IBM's ESSL following the BLAS/LAPACK APIs. These libraries can give you speed-ups of up to two orders of magnitude compared to the standard LAPACK library. In fact a number of distributions provide numpy implementations that are statically linked to Intel's MKL and when building, NumPy gives a preference to the optimized libraries (MKL, BLIS) over the slower, more hardware independent libraries (OpenBLAS, ATLAS).\n",
    "\n",
    "**Notes:** \n",
    "- Not all routines of MKL are necessarily faster than those in OpenBLAS, e.g. random number generation.\n",
    "- By default NumPy only links the linear algebra routines to MKL. On PyPi you can find packages that wrap MKL's FFT and random number generation functions in Python.\n",
    "- By default NumPy prefers Intel MKL over AMD BLIS, even on AMD processors. MKL works with AMD processors, but by default does not use the fast AVX2 and AVX512 BLAS routines that it uses when it has detected Intel hardware but only the much slower SSE2 routines. While there are hacks to change that behaviour it may therefore be more reasonable to change the order of preference for the different BLAS libraries on AMD platforms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "You can check which BLAS library has been linked during your build of NumPy by calling the `show_config()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "print(np.__version__)\n",
    "print(np.show_config())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## ndarray"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The main datatype in NumPy is an n-dimensional array object called `ndarray`. By default it has C-array memory layout \n",
    "(last index fastest, 0 based indices) and a uniform fixed element type defined at instantiation of the ndarray. Element sizes range from 1 byte (bool, uint8) to 16 byte (complex256) with the default being `float64`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### introspection, shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Several meta data of an `ndarray` can be obtained among them `a.ndim,a.size,a.nbytes,a.flags,type(a)`.\n",
    "The array interface enables in-depth introspection such as obtaining the address of an object, e.g. for debugging purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "myarr=np.array([[1,2,3],[4,5,6]])\n",
    "print(f\" myarr is a {myarr.ndim}-dimensional {type(myarr)} of type {myarr.dtype} and shape {myarr.shape}. It starts at adress {hex(myarr.__array_interface__['data'][0])}.\")\n",
    "print(f\"Each of its {myarr.size} {myarr.dtype}-elements has a size of {myarr.dtype.itemsize} byte therefore it has {myarr.nbytes} bytes in total and its strides (in bytes) are {myarr.strides}.\")\n",
    "print(\"The flags stored for this array are:\")\n",
    "print(myarr.flags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## NumPy's ndarray as a foundation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`ndarrays` support important concepts to enable vectorized processing of data, such as broadcasting, ufuncs, and reductions. Why just use this for numeric data? Almost anything can be cast into an `ndarray` and also labeled access is possible. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "dt=np.dtype([('name','S10'),('income',np.int16),\\\n",
    "             ('revenue',([('q1',np.float32),('q2',np.float32),('q3',np.float32),('q4',np.float32)]))])\n",
    "f=np.loadtxt('test.dat',dtype=dt, skiprows=2,delimiter=',')\n",
    "print(f)\n",
    "print(f\"this array has {f.ndim} dimension(s)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Note that most functionality in NumPy is geared towards arrays of uniform numerical type. The above data type indicates that the entire row is treated as a single (albeit very complicated) type and the data in the table thereby become a one-dimensional array. If you need spreadsheet-like functionality, NumPy's 'structured arrays' or 'record type' arrays that are based on `ndarrays` are not adequate. Luckily `ndarrays` can be extended in functionality and even reimplemented in a compatible way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Design features of the ndarray API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Due to its dual interface in Python and C and its often (not always) regular memory layout `ndarrays` can relatively easily be mapped to data types described in other language contexts. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "A couple of clever design choices adds to the popularity of ndarrays as a standard API for high performance computing in Python:\n",
    "- ndarrays support both C-style and Fortran-style memory layouts, as well as memory alignment and view interfaces.\n",
    "- A Mixin class `NDArrayOperatorsMixin` provides all the necessary standard python interfaces like `__add__`, `__lt__` etc. making custom array containers more convenient to write.\n",
    "- `ndarray.__array_ufunc__` and `__array_function__` are interfaces that split the function API from the excution engine. By overriding them, these functions allow to intercept calls to ufuncs and non-ufuncs and to dispatch these calls to alternative implementations. Several important packages already use this approach."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Packages that implement the ndarray interface:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<table>\n",
    "    <tr><th>package/module</th><th>extension(s)</th><th>other HPC relevant characteristics</th></tr>\n",
    "    <tr><td><code><b><a href=\"https://cupy.dev/\">CuPy</a></b></code></td><td>cuPy array: subset of ndarray functionality</td><td>ndarrays on GPUs</td></tr>\n",
    "    <tr><td><code><b><a href=\"https://pandas.pydata.org/\">pandas</a></b></code></td><td>pandas dataframe: labeled data, non-numerical data, multi-indexing, grouping</td><td>fast CSV-parser</td></tr>\n",
    "    <tr><td><code><b><a href=\"https://github.com/rapidsai/cudf\">cuDF</a></b></code></td><td>cuDF dataframe: subset of pandas dataframe functionality</td><td>pandas dataframes on GPUs</td></tr>\n",
    "    <tr><td><code><b><a href=\"https://www.dask.org/\">dask</a></b></code></td><td>dask array: only subset of ndarray functionality</td><td>tiled ndarrays larger than main memory, distributed processing on multiple nodes</td></tr>\n",
    "    <tr><td><code><b><a href=\"https://www.dask.org/\">dask</a></b></code></td><td>dask dataframe: only subset of pandas dataframe functionality</td><td>tiled dataframes larger than main memory, distributed processing on multiple nodes</td></tr>\n",
    "    <tr><td><code><b><a href=\"https://docs.rapids.ai/api/cudf/nightly/user_guide/10min.html\">dask-cuDF</a></b></code></td><td>cuDF dataframe: subset of pandas dataframe functionality</td><td>tiled dataframes on multiple GPUs and multiple nodes</td></tr>\n",
    "    <tr><td><code><b><a href=\"https://sparse.pydata.org/en/0.13.0/\">sparse</a></b></code></td><td>ndarray functionality on sparse arrays (COO layout)</td><td></td></tr>\n",
    "    <tr><td><code><b><a href=\"https://docs.scipy.org/doc/scipy/reference/sparse.html\">SciPy.sparse</a></b></code></td><td>ndarray functionality on sparse arrays (all layouts)</td><td></td></tr>\n",
    "    </table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "One notable package that does not implement the ndarray interface is `numpy.matrix` while most ML frameworks have recently added experimental (i.e. incomplete) interfaces or are working on them like [JAX/XLA][], [tensorflow][], and [pytorch][]. As data structures in these packages differ regarding their allowed access (many packages implement tensors as immutable types) and their memory alignment requirements (some require 16 byte alignments) their functionality often covers only a small subset of the full ndarray capabilities. There are a lot of discussions ongoing how to arrive at a more universal array/tensor interface (see e.g. [here][]). \n",
    "\n",
    "[here]: https://discuss.scientific-python.org/t/a-proposed-design-for-supporting-multiple-array-types-across-scipy-scikit-learn-scikit-image-and-beyond/131/18\n",
    "[JAX/XLA]: https://github.com/google/jax/issues/4486\n",
    "[pytorch]: https://github.com/pytorch/pytorch/issues/54138\n",
    "[tensorflow]: https://www.tensorflow.org/guide/tf_numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "End"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "HPC Python 2022",
   "language": "python",
   "name": "hpcpy22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}