From 81d4b6a65bf09e59349d362bf338c27d09be4e06 Mon Sep 17 00:00:00 2001 From: "Jan H. Meinke" <j.meinke@fz-juelich.de> Date: Wed, 31 Mar 2021 12:58:41 +0200 Subject: [PATCH] Remove check points. --- .../00_Bottlenecks-checkpoint.ipynb | 788 ------------------ 1 file changed, 788 deletions(-) delete mode 100644 solutions/.ipynb_checkpoints/00_Bottlenecks-checkpoint.ipynb diff --git a/solutions/.ipynb_checkpoints/00_Bottlenecks-checkpoint.ipynb b/solutions/.ipynb_checkpoints/00_Bottlenecks-checkpoint.ipynb deleted file mode 100644 index c13468d..0000000 --- a/solutions/.ipynb_checkpoints/00_Bottlenecks-checkpoint.ipynb +++ /dev/null @@ -1,788 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "# Bottlenecks\n", - "\n", - "<div class=\"dateauthor\">\n", - "16 Nov 2020 | Jan H. Meinke\n", - "</div>" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## High-performance computing is computing at the limit" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "CPU" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "Memory" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "I/O" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## CPU\n", - "### On JUWELS" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "\n", - "Each core has the following features:\n", - "\n", - "- processor frequency of 2.7 GHz (up to 3.7 GHz in turbo mode)\n", - "- 512 bit wide vector unit (8 double precision numbers)\n", - "- 2 Fused multiply-add (FMA) units each performing an FMA operation in a single cycle (each FMA counts as 2 floating point operations)\n", - "- When running all cores each FMA unit runs at 2.5 GHz (max)\n", - "- Peak performance is 2.5 \\* 8 \\* 2 \\* 2 GFlop/s = 80 GFlop/s" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "There are 24 cores per socket\n", - "- Peak performance of a socket is 24 * 80 GFlop/s = 1920 GFlop/s" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "This is the limit most people think of first, but it's often not the crucial one. Each core on JUWELS can perform ca. 80 GFlop/s if the code is completely *vectorized* and performs a *multiply and an add operation* at *each step*. If your code doesn't fulfill those requirements its peak performance will be less.\n", - "\n", - "Actually the peak performance is a often less because the CPU temperature limit typically soon leads to throttling when using all cores. The guaranteed sustained frequency for the JUWELS FMA units is 1.9 GHz." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Memory hierarchy\n", - "### Caches" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "* L1 (per core):\n", - " - 32 kiB data cache (+ 32 kiB instruction cache)\n", - " - Fastest memory (128 (load) + 64 (store)) B/cycle (latency: 4-5 cycles)\n", - "* L2 (per core): \n", - " - 1 MiB\n", - " - 64 B/cycle (latency: 14 cycles)\n", - "* L3 (shared): \n", - " - 33 MiB\n", - " - ?? B/cycle (latency: 50-70 cycles)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "\n", - "### Main memory" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "The memory bandwidth of a JUWELS node is about " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "120 GB/s. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## A simple operation" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "c = c + a * b (multiply-add)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "3 DP read, 1 DP write -> 24 bytes read, 8 bytes write" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "120 GB/s / 24 bytes/op = 5 Gop/s (multiply-add -> 10 GFLOP/s)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "I assume that we are dealing with double precision numbers (8 bytes). Then I have to read 3 * 8 bytes = 24 bytes and write 8 bytes. This is a multiply-add operation, so each core can do 40 billion of those per second, but it only receives 120 GB/s. 120GB/s / 24 B/op = 5Gop/s (10 GFLOP/s). This operation is clearly memory bound, if we have to get all the data from main memory." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## Matrix multiplication" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "This is quite common. Let's look at a matrix multiplication $C=AB$. To calculate the element i, j of the result matrix C, we multiply row i of A with column j of B and sum the results. This is the scalar or dot product of row i of A and column j of B. In code this looks like this:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "import numpy\n", - "\n", - "def dot(a, b):\n", - " \"\"\"Multiply the matrix a with the matrix b.\n", - " \n", - " Parameters\n", - " ----------\n", - " a: ndarray\n", - " left matrix\n", - " b: ndarray\n", - " right matrix\n", - " \n", - " Return\n", - " ------\n", - " c: ndarray\n", - " result matrix\n", - " \"\"\"\n", - " c = numpy.zeros((a.shape[0], b.shape[1])) \n", - " for i in range(a.shape[0]): \n", - " for j in range(b.shape[1]): \n", - " for k in range(a.shape[1]): \n", - " c[i, j] += a[i, k] * b[k, j]\n", - " return c " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "Let's take two small matrices A and B and see how long the above function takes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "n = 256\n", - "a = numpy.random.random((n, n))\n", - "b = numpy.random.random((n, n))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "t_dot = %timeit -o dot(a,b)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "A matrix multiplication of two n by n matrices performs $2n^3$ operations. The dot function achieves" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "print(\"%.3f GFLOP/s\" % (2e-9 * n**3 / t_dot.best)) " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "Wow, that's bad. Let's see if we can make this faster." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Numba" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "from numba import jit\n", - "jdot = jit(dot)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "t_jit = %timeit -o jdot(a, b)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "print(\"%.3f GFLOP/s\" % (2e-9 * n**3 / t_jit.best)) " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Access order and cache lines" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "From our estimate above, we should be able to get at least twice this, but that's assuming we can achieve the maximum memory bandwidth. \n", - "\n", - "A numpy ndarray uses C-order (row-order) for storage. This means that the last index is continuous in memory and a change in any other index results in a jump from one memory location to another. The order of the loops therefore means that for both c and b, we don't get the maximum bandwidth, because we jump around and only use one element of the cache line. \n", - "\n", - "A datum is not loaded by itself. Instead everytime, a datum is needed that is not available in cache, a cache line containing the datum is loaded. On JUWELS the cache line is 64 bytes wide. \n", - "\n", - "We can improve the performance by exchanging the loops:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "import numpy; from numba import jit\n", - "\n", - "@jit\n", - "def dot2(a, b):\n", - " \"\"\"Multiply the matrix a with the matrix b.\n", - " \n", - " Parameters\n", - " ----------\n", - " a: ndarray\n", - " left matrix\n", - " b: ndarray\n", - " right matrix\n", - " \n", - " Return\n", - " ------\n", - " c: ndarray\n", - " result matrix\n", - " \"\"\"\n", - " c = numpy.zeros((a.shape[0], b.shape[1])) \n", - " for i in range(a.shape[0]): \n", - " for k in range(a.shape[1]): \n", - " for j in range(b.shape[1]): \n", - " c[i, j] += a[i, k] * b[k, j]\n", - " return c " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "Now, elements in c and b are accessed in the proper order and a[i, k] is constant for the loop. This changes our estimate, because, now we read 16 bytes/op. This gives us a maximum of 120 GB/s / 16 bytes/op = 7.5 GFLOP/s (15 GB/s). #Todo: shouldn't the last two not be gops and GFLOPs andtead of GFLOPs and GB/s??" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "### Effect on matrix multiplication" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "t_dot2 = %timeit -o dot2(a, b)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "print(2e-9 * n**3 / t_dot2.best, \"GFLOP/s\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "This is better than what we would expect from the bandwidth estimate, probably due to caching. One way to test this is to make the matrix larger, so that it doesn't fit in cache anymore." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "n = 2048\n", - "a = numpy.random.random((n,n))\n", - "b = numpy.random.random((n,n))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "t_dot2_large = %timeit -o dot2(a,b)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "print(2e-9 * n**3 / t_dot2_large.best, \"GFLOP/s\") " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "This is quite close to my estimate and shows that I have to increase the number of operations per byte loaded from main memory. Improving vectorization or using multiple threads wouldn't help.\n", - "\n", - "To improve cache utilization, we have to change the algorithm. One way to improve the performance of the matrix multiplication is blocking (aka tiling). This is done, e.g., in OpenBLAS or Intel's Math Kernel Library (MKL)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "## Numpy" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "Let's see how long numpy takes for this:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "t_numpy = %timeit -o numpy.dot(a, b)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "outputs": [], - "source": [ - "print(2e-9 * n**3 / t_numpy.best, \"GFLOP/s\") " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "The numpy version we use here, uses a fast math library. That's what you want." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## The roofline model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "The roofline model shows the memory bandwidth bound and compute bound with respect to the computational intensity. The computational intensity is just given by the number of bytes used divided by the number of operations performed." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "Depending on your algorithm, different limits may be relevant, for example, we only used a single thread, but used the peak performance of the entire processor with 24 cores. If the data fits completely in L2 cache the available bandwidth is higher once the data has been loaded. The following shows a plot with a few more limits." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## I/O" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "### GPFS File System\n", - "$\\mathcal{O}(200)$ GB/s read/write speed for scratch" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "Each node connected to file system with $\\mathcal{O}(100)$ GBit/s or about 12.5 GB/s." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "notes" - } - }, - "source": [ - "The scratch file system achieves read/write bandwidths that are very similar to the main memory bandwidth, but not for a single node. Each node is connected to the GPFS file system with $\\mathcal{O}(100)$ GBit/s connection. In other words, we can read/write about 12.5 GB/s. If we had to load the data in the previous calculation from disk, we could only achieve 12.5 GB/s / 24 bytes/op = 520 Mop/s. The main memory bandwidth or the peak performance of the CPU doesn't matter in this case." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "HPC Python 2020", - "language": "python", - "name": "hpcpy20" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.2" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} -- GitLab