Remove check points.

81d4b6a6 · Jan Meinke · fe2558a5 · fe2558a5
Commit 81d4b6a6 authored Mar 31, 2021 by Jan Meinke
--- a/solutions/.ipynb_checkpoints/00_Bottlenecks-checkpoint.ipynb
+++ b/solutions/.ipynb_checkpoints/00_Bottlenecks-checkpoint.ipynb
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "# Bottlenecks\n",
-    "\n",
-    "<div class=\"dateauthor\">\n",
-    "16 Nov 2020 | Jan H. Meinke\n",
-    "</div>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## High-performance computing is computing at the limit"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "CPU"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "Memory"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "I/O"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## CPU\n",
-    "### On JUWELS"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "\n",
-    "Each core has the following features:\n",
-    "\n",
-    "- processor frequency of 2.7 GHz (up to 3.7 GHz in turbo mode)\n",
-    "- 512 bit wide vector unit (8 double precision numbers)\n",
-    "- 2 Fused multiply-add (FMA) units each performing an FMA operation in a single cycle (each FMA counts as 2 floating point operations)\n",
-    "- When running all cores each FMA unit runs at 2.5 GHz (max)\n",
-    "- Peak performance is 2.5 \\* 8 \\* 2 \\* 2 GFlop/s = 80 GFlop/s"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "There are 24 cores per socket\n",
-    "- Peak performance of a socket is 24 * 80 GFlop/s = 1920 GFlop/s"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "This is the limit most people think of first, but it's often not the crucial one. Each core on JUWELS can perform ca. 80 GFlop/s if the code is completely *vectorized* and performs a *multiply and an add operation* at *each step*. If your code doesn't fulfill those requirements its peak performance will be less.\n",
-    "\n",
-    "Actually the peak performance is a often less because the CPU temperature limit typically soon leads to throttling when using all cores. The guaranteed sustained frequency for the JUWELS FMA units is 1.9 GHz."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## Memory hierarchy\n",
-    "### Caches"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "* L1 (per core):\n",
-    "   -  32 kiB data cache (+ 32 kiB instruction cache)\n",
-    "   -  Fastest memory (128 (load) + 64 (store)) B/cycle (latency: 4-5 cycles)\n",
-    "* L2 (per core):  \n",
-    "   -  1 MiB\n",
-    "   -  64 B/cycle (latency: 14 cycles)\n",
-    "* L3 (shared):  \n",
-    "   -  33 MiB\n",
-    "   -  ?? B/cycle (latency: 50-70 cycles)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "\n",
-    "### Main memory"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "skip"
-    }
-   },
-   "source": [
-    "The memory bandwidth of a JUWELS node is about "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "120 GB/s. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## A simple operation"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "c = c + a * b (multiply-add)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "3 DP read, 1 DP write -> 24 bytes read, 8 bytes write"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "120 GB/s / 24 bytes/op = 5 Gop/s (multiply-add -> 10 GFLOP/s)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "I assume that we are dealing with double precision numbers (8 bytes). Then I have to read 3 * 8 bytes = 24 bytes and write 8 bytes. This is a multiply-add operation, so each core can do 40 billion of those per second, but it only receives 120 GB/s. 120GB/s / 24 B/op = 5Gop/s (10 GFLOP/s). This operation is clearly memory bound, if we have to get all the data from main memory."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## Matrix multiplication"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "This is quite common. Let's look at a matrix multiplication $C=AB$. To calculate the element i, j of the result matrix C, we multiply row i of A with column j of B and sum the results. This is the scalar or dot product of row i of A and column j of B. In code this looks like this:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "import numpy\n",
-    "\n",
-    "def dot(a, b):\n",
-    "    \"\"\"Multiply the matrix a with the matrix b.\n",
-    "    \n",
-    "    Parameters\n",
-    "    ----------\n",
-    "    a: ndarray\n",
-    "        left matrix\n",
-    "    b: ndarray\n",
-    "        right matrix\n",
-    "    \n",
-    "    Return\n",
-    "    ------\n",
-    "    c: ndarray\n",
-    "        result matrix\n",
-    "    \"\"\"\n",
-    "    c = numpy.zeros((a.shape[0], b.shape[1]))   \n",
-    "    for i in range(a.shape[0]):             \n",
-    "        for j in range(b.shape[1]):         \n",
-    "            for k in range(a.shape[1]):     \n",
-    "                 c[i, j] += a[i, k] * b[k, j]\n",
-    "    return c                                "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "Let's take two small matrices A and B and see how long the above function takes."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "n = 256\n",
-    "a = numpy.random.random((n, n))\n",
-    "b = numpy.random.random((n, n))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "t_dot = %timeit -o dot(a,b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "A matrix multiplication of two n by n matrices performs $2n^3$ operations. The dot function achieves"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"%.3f GFLOP/s\" % (2e-9 * n**3 / t_dot.best)) "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "skip"
-    }
-   },
-   "source": [
-    "Wow, that's bad. Let's see if we can make this faster."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "## Numba"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "from numba import jit\n",
-    "jdot = jit(dot)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "t_jit = %timeit -o jdot(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(\"%.3f GFLOP/s\" % (2e-9 * n**3 / t_jit.best)) "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "## Access order and cache lines"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "From our estimate above, we should be able to get at least twice this, but that's assuming we can achieve the maximum memory bandwidth. \n",
-    "\n",
-    "A numpy ndarray uses C-order (row-order) for storage. This means that the last index is continuous in memory and a change in any other index results in a jump from one memory location to another. The order of the loops therefore means that for both c and b, we don't get the maximum bandwidth, because we jump around and only use one element of the cache line. \n",
-    "\n",
-    "A datum is not loaded by itself. Instead everytime, a datum is needed that is not available in cache, a cache line containing the datum is loaded. On JUWELS the cache line is 64 bytes wide. \n",
-    "\n",
-    "We can improve the performance by exchanging the loops:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "import numpy; from numba import jit\n",
-    "\n",
-    "@jit\n",
-    "def dot2(a, b):\n",
-    "    \"\"\"Multiply the matrix a with the matrix b.\n",
-    "    \n",
-    "    Parameters\n",
-    "    ----------\n",
-    "    a: ndarray\n",
-    "        left matrix\n",
-    "    b: ndarray\n",
-    "        right matrix\n",
-    "    \n",
-    "    Return\n",
-    "    ------\n",
-    "    c: ndarray\n",
-    "        result matrix\n",
-    "    \"\"\"\n",
-    "    c = numpy.zeros((a.shape[0], b.shape[1]))   \n",
-    "    for i in range(a.shape[0]):             \n",
-    "        for k in range(a.shape[1]):     \n",
-    "            for j in range(b.shape[1]):         \n",
-    "                 c[i, j] += a[i, k] * b[k, j]\n",
-    "    return c                                 "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "Now, elements in c and b are accessed in the proper order and a[i, k] is constant for the loop. This changes our estimate, because, now we read 16 bytes/op. This gives us a maximum of 120 GB/s / 16 bytes/op = 7.5 GFLOP/s (15 GB/s).  #Todo: shouldn't the last two not be gops and GFLOPs andtead of GFLOPs and GB/s??"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "### Effect on matrix multiplication"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "t_dot2 = %timeit -o dot2(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(2e-9 * n**3 / t_dot2.best, \"GFLOP/s\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "This is better than what we would expect from the bandwidth estimate, probably due to caching. One way to test this is to make the matrix larger, so that it doesn't fit in cache anymore."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "n = 2048\n",
-    "a = numpy.random.random((n,n))\n",
-    "b = numpy.random.random((n,n))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "t_dot2_large = %timeit -o dot2(a,b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(2e-9 * n**3 / t_dot2_large.best, \"GFLOP/s\") "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "This is quite close to my estimate and shows that I have to increase the number of operations per byte loaded from main memory. Improving vectorization or using multiple threads wouldn't help.\n",
-    "\n",
-    "To improve cache utilization, we have to change the algorithm. One way to improve the performance of the matrix multiplication is blocking (aka tiling). This is done, e.g., in OpenBLAS or Intel's Math Kernel Library (MKL)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "## Numpy"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "Let's see how long numpy takes for this:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "t_numpy = %timeit -o numpy.dot(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "outputs": [],
-   "source": [
-    "print(2e-9 * n**3 / t_numpy.best, \"GFLOP/s\") "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "The numpy version we use here, uses a fast math library. That's what you want."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## The roofline model"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "The roofline model shows the memory bandwidth bound and compute bound with respect to the computational intensity. The computational intensity is just given by the number of bytes used divided by the number of operations performed."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "![A simple roofline model.](images/Roofline_juwels_1.svg)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "Depending on your algorithm, different limits may be relevant, for example, we only used a single thread, but used the peak performance of the entire processor with 24 cores. If the data fits completely in L2 cache the available bandwidth is higher once the data has been loaded. The following shows a plot with a few more limits."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "![Roofline plot with multiple memory boundaries.](images/Roofline_juwels_2.svg)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## I/O"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "### GPFS File System\n",
-    "$\\mathcal{O}(200)$ GB/s read/write speed for scratch"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "Each node connected to file system with $\\mathcal{O}(100)$ GBit/s or about 12.5 GB/s."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "notes"
-    }
-   },
-   "source": [
-    "The scratch file system achieves read/write bandwidths that are very similar to the main memory bandwidth, but not for a single node. Each node is connected to the GPFS file system with $\\mathcal{O}(100)$ GBit/s connection. In other words, we can read/write about 12.5 GB/s. If we had to load the data in the previous calculation from disk, we could only achieve 12.5 GB/s / 24 bytes/op = 520 Mop/s. The main memory bandwidth or the peak performance of the CPU doesn't matter in this case."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "skip"
-    }
-   },
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "HPC Python 2020",
-   "language": "python",
-   "name": "hpcpy20"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.9.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
-%% Cell type:markdown id: tags:
-
-# Bottlenecks
-
-<div class="dateauthor">
-16 Nov 2020 | Jan H. Meinke
-</div>
-
-%% Cell type:markdown id: tags:
-
-## High-performance computing is computing at the limit
-
-%% Cell type:markdown id: tags:
-
-CPU
-
-%% Cell type:markdown id: tags:
-
-Memory
-
-%% Cell type:markdown id: tags:
-
-I/O
-
-%% Cell type:markdown id: tags:
-
-## CPU
-### On JUWELS
-
-%% Cell type:markdown id: tags:
-
-
-Each core has the following features:
-
- processor frequency of 2.7 GHz (up to 3.7 GHz in turbo mode)
- 512 bit wide vector unit (8 double precision numbers)
- 2 Fused multiply-add (FMA) units each performing an FMA operation in a single cycle (each FMA counts as 2 floating point operations)
- When running all cores each FMA unit runs at 2.5 GHz (max)
- Peak performance is 2.5 \* 8 \* 2 \* 2 GFlop/s = 80 GFlop/s
-
-%% Cell type:markdown id: tags:
-
-There are 24 cores per socket
- Peak performance of a socket is 24 * 80 GFlop/s = 1920 GFlop/s
-
-%% Cell type:markdown id: tags:
-
-This is the limit most people think of first, but it's often not the crucial one. Each core on JUWELS can perform ca. 80 GFlop/s if the code is completely *vectorized* and performs a *multiply and an add operation* at *each step*. If your code doesn't fulfill those requirements its peak performance will be less.
-
-Actually the peak performance is a often less because the CPU temperature limit typically soon leads to throttling when using all cores. The guaranteed sustained frequency for the JUWELS FMA units is 1.9 GHz.
-
-%% Cell type:markdown id: tags:
-
-## Memory hierarchy
-### Caches
-
-%% Cell type:markdown id: tags:
-
-* L1 (per core):
-   -  32 kiB data cache (+ 32 kiB instruction cache)
-   -  Fastest memory (128 (load) + 64 (store)) B/cycle (latency: 4-5 cycles)
-* L2 (per core):
-   -  1 MiB
-   -  64 B/cycle (latency: 14 cycles)
-* L3 (shared):
-   -  33 MiB
-   -  ?? B/cycle (latency: 50-70 cycles)
-
-%% Cell type:markdown id: tags:
-
-
-### Main memory
-
-%% Cell type:markdown id: tags:
-
-The memory bandwidth of a JUWELS node is about
-
-%% Cell type:markdown id: tags:
-
-120 GB/s.
-
-%% Cell type:markdown id: tags:
-
-## A simple operation
-
-%% Cell type:markdown id: tags:
-
-c = c + a * b (multiply-add)
-
-%% Cell type:markdown id: tags:
-
-3 DP read, 1 DP write -> 24 bytes read, 8 bytes write
-
-%% Cell type:markdown id: tags:
-
-120 GB/s / 24 bytes/op = 5 Gop/s (multiply-add -> 10 GFLOP/s)
-
-%% Cell type:markdown id: tags:
-
-I assume that we are dealing with double precision numbers (8 bytes). Then I have to read 3 * 8 bytes = 24 bytes and write 8 bytes. This is a multiply-add operation, so each core can do 40 billion of those per second, but it only receives 120 GB/s. 120GB/s / 24 B/op = 5Gop/s (10 GFLOP/s). This operation is clearly memory bound, if we have to get all the data from main memory.
-
-%% Cell type:markdown id: tags:
-
-## Matrix multiplication
-
-%% Cell type:markdown id: tags:
-
-This is quite common. Let's look at a matrix multiplication $C=AB$. To calculate the element i, j of the result matrix C, we multiply row i of A with column j of B and sum the results. This is the scalar or dot product of row i of A and column j of B. In code this looks like this:
-
-%% Cell type:code id: tags:
-
-``` python
-import numpy
-
-def dot(a, b):
-    """Multiply the matrix a with the matrix b.
-
-    Parameters
-    ----------
-    a: ndarray
-        left matrix
-    b: ndarray
-        right matrix
-
-    Return
-    ------
-    c: ndarray
-        result matrix
-    """
-    c = numpy.zeros((a.shape[0], b.shape[1]))
-    for i in range(a.shape[0]):
-        for j in range(b.shape[1]):
-            for k in range(a.shape[1]):
-                 c[i, j] += a[i, k] * b[k, j]
-    return c
-```
-
-%% Cell type:markdown id: tags:
-
-Let's take two small matrices A and B and see how long the above function takes.
-
-%% Cell type:code id: tags:
-
-``` python
-n = 256
-a = numpy.random.random((n, n))
-b = numpy.random.random((n, n))
-```
-
-%% Cell type:code id: tags:
-
-``` python
-t_dot = %timeit -o dot(a,b)
-```
-
-%% Cell type:markdown id: tags:
-
-A matrix multiplication of two n by n matrices performs $2n^3$ operations. The dot function achieves
-
-%% Cell type:code id: tags:
-
-``` python
-print("%.3f GFLOP/s" % (2e-9 * n**3 / t_dot.best))
-```
-
-%% Cell type:markdown id: tags:
-
-Wow, that's bad. Let's see if we can make this faster.
-
-%% Cell type:markdown id: tags:
-
-## Numba
-
-%% Cell type:code id: tags:
-
-``` python
-from numba import jit
-jdot = jit(dot)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-t_jit = %timeit -o jdot(a, b)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-print("%.3f GFLOP/s" % (2e-9 * n**3 / t_jit.best))
-```
-
-%% Cell type:markdown id: tags:
-
-## Access order and cache lines
-
-%% Cell type:markdown id: tags:
-
-From our estimate above, we should be able to get at least twice this, but that's assuming we can achieve the maximum memory bandwidth.
-
-A numpy ndarray uses C-order (row-order) for storage. This means that the last index is continuous in memory and a change in any other index results in a jump from one memory location to another. The order of the loops therefore means that for both c and b, we don't get the maximum bandwidth, because we jump around and only use one element of the cache line.
-
-A datum is not loaded by itself. Instead everytime, a datum is needed that is not available in cache, a cache line containing the datum is loaded. On JUWELS the cache line is 64 bytes wide.
-
-We can improve the performance by exchanging the loops:
-
-%% Cell type:code id: tags:
-
-``` python
-import numpy; from numba import jit
-
-@jit
-def dot2(a, b):
-    """Multiply the matrix a with the matrix b.
-
-    Parameters
-    ----------
-    a: ndarray
-        left matrix
-    b: ndarray
-        right matrix
-
-    Return
-    ------
-    c: ndarray
-        result matrix
-    """
-    c = numpy.zeros((a.shape[0], b.shape[1]))
-    for i in range(a.shape[0]):
-        for k in range(a.shape[1]):
-            for j in range(b.shape[1]):
-                 c[i, j] += a[i, k] * b[k, j]
-    return c
-```
-
-%% Cell type:markdown id: tags:
-
-Now, elements in c and b are accessed in the proper order and a[i, k] is constant for the loop. This changes our estimate, because, now we read 16 bytes/op. This gives us a maximum of 120 GB/s / 16 bytes/op = 7.5 GFLOP/s (15 GB/s).  #Todo: shouldn't the last two not be gops and GFLOPs andtead of GFLOPs and GB/s??
-
-%% Cell type:markdown id: tags:
-
-### Effect on matrix multiplication
-
-%% Cell type:code id: tags:
-
-``` python
-t_dot2 = %timeit -o dot2(a, b)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-print(2e-9 * n**3 / t_dot2.best, "GFLOP/s")
-```
-
-%% Cell type:markdown id: tags:
-
-This is better than what we would expect from the bandwidth estimate, probably due to caching. One way to test this is to make the matrix larger, so that it doesn't fit in cache anymore.
-
-%% Cell type:code id: tags:
-
-``` python
-n = 2048
-a = numpy.random.random((n,n))
-b = numpy.random.random((n,n))
-```
-
-%% Cell type:code id: tags:
-
-``` python
-t_dot2_large = %timeit -o dot2(a,b)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-print(2e-9 * n**3 / t_dot2_large.best, "GFLOP/s")
-```
-
-%% Cell type:markdown id: tags:
-
-This is quite close to my estimate and shows that I have to increase the number of operations per byte loaded from main memory. Improving vectorization or using multiple threads wouldn't help.
-
-To improve cache utilization, we have to change the algorithm. One way to improve the performance of the matrix multiplication is blocking (aka tiling). This is done, e.g., in OpenBLAS or Intel's Math Kernel Library (MKL).
-
-%% Cell type:markdown id: tags:
-
-## Numpy
-
-%% Cell type:markdown id: tags:
-
-Let's see how long numpy takes for this:
-
-%% Cell type:code id: tags:
-
-``` python
-t_numpy = %timeit -o numpy.dot(a, b)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-print(2e-9 * n**3 / t_numpy.best, "GFLOP/s")
-```
-
-%% Cell type:markdown id: tags:
-
-The numpy version we use here, uses a fast math library. That's what you want.
-
-%% Cell type:markdown id: tags:
-
-## The roofline model
-
-%% Cell type:markdown id: tags:
-
-The roofline model shows the memory bandwidth bound and compute bound with respect to the computational intensity. The computational intensity is just given by the number of bytes used divided by the number of operations performed.
-
-%% Cell type:markdown id: tags:
-
-![A simple roofline model.](images/Roofline_juwels_1.svg)
-
-%% Cell type:markdown id: tags:
-
-Depending on your algorithm, different limits may be relevant, for example, we only used a single thread, but used the peak performance of the entire processor with 24 cores. If the data fits completely in L2 cache the available bandwidth is higher once the data has been loaded. The following shows a plot with a few more limits.
-
-%% Cell type:markdown id: tags:
-
-![Roofline plot with multiple memory boundaries.](images/Roofline_juwels_2.svg)
-
-%% Cell type:markdown id: tags:
-
-## I/O
-
-%% Cell type:markdown id: tags:
-
-### GPFS File System
-$\mathcal{O}(200)$ GB/s read/write speed for scratch
-
-%% Cell type:markdown id: tags:
-
-Each node connected to file system with $\mathcal{O}(100)$ GBit/s or about 12.5 GB/s.
-
-%% Cell type:markdown id: tags:
-
-The scratch file system achieves read/write bandwidths that are very similar to the main memory bandwidth, but not for a single node. Each node is connected to the GPFS file system with $\mathcal{O}(100)$ GBit/s connection. In other words, we can read/write about 12.5 GB/s. If we had to load the data in the previous calculation from disk, we could only achieve 12.5 GB/s / 24 bytes/op = 520 Mop/s. The main memory bandwidth or the peak performance of the CPU doesn't matter in this case.
-
-%% Cell type:code id: tags:
-
-``` python
-```