"- Peak performance of a socket is 24 * 80 GFlop/s = 1920 GFlop/s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"This is the limit most people think of first, but it's often not the crucial one. Each core on JUWELS can perform ca. 80 GFlop/s if the code is completely *vectorized* and performs a *multiply and an add operation* at *each step*. If your code doesn't fulfill those requirements its peak performance will be less.\n",
"\n",
"Actually the peak performance is a often less because the CPU temperature limit typically soon leads to throttling when using all cores. The guaranteed sustained frequency for the JUWELS FMA units is 1.9 GHz."
"I assume that we are dealing with double precision numbers (8 bytes). Then I have to read 3 * 8 bytes = 24 bytes and write 8 bytes. This is a multiply-add operation, so each core can do 40 billion of those per second, but it only receives 120 GB/s. 120GB/s / 24 B/op = 5Gop/s (10 GFLOP/s). This operation is clearly memory bound, if we have to get all the data from main memory."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Matrix multiplication"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"This is quite common. Let's look at a matrix multiplication $C=AB$. To calculate the element i, j of the result matrix C, we multiply row i of A with column j of B and sum the results. This is the scalar or dot product of row i of A and column j of B. In code this looks like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import numpy\n",
"\n",
"def dot(a, b):\n",
" \"\"\"Multiply the matrix a with the matrix b.\n",
" \n",
" Parameters\n",
" ----------\n",
" a: ndarray\n",
" left matrix\n",
" b: ndarray\n",
" right matrix\n",
" \n",
" Return\n",
" ------\n",
" c: ndarray\n",
" result matrix\n",
" \"\"\"\n",
" c = numpy.zeros((a.shape[0], b.shape[1])) \n",
" for i in range(a.shape[0]): \n",
" for j in range(b.shape[1]): \n",
" for k in range(a.shape[1]): \n",
" c[i, j] += a[i, k] * b[k, j]\n",
" return c "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Let's take two small matrices A and B and see how long the above function takes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"n = 256\n",
"a = numpy.random.random((n, n))\n",
"b = numpy.random.random((n, n))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"t_dot = %timeit -o dot(a,b)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"A matrix multiplication of two n by n matrices performs $2n^3$ operations. The dot function achieves"
"From our estimate above, we should be able to get at least twice this, but that's assuming we can achieve the maximum memory bandwidth. \n",
"\n",
"A numpy ndarray uses C-order (row-order) for storage. This means that the last index is continuous in memory and a change in any other index results in a jump from one memory location to another. The order of the loops therefore means that for both c and b, we don't get the maximum bandwidth, because we jump around and only use one element of the cache line. \n",
"\n",
"A datum is not loaded by itself. Instead everytime, a datum is needed that is not available in cache, a cache line containing the datum is loaded. On JUWELS the cache line is 64 bytes wide. \n",
"\n",
"We can improve the performance by exchanging the loops:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import numpy; from numba import jit\n",
"\n",
"@jit\n",
"def dot2(a, b):\n",
" \"\"\"Multiply the matrix a with the matrix b.\n",
" \n",
" Parameters\n",
" ----------\n",
" a: ndarray\n",
" left matrix\n",
" b: ndarray\n",
" right matrix\n",
" \n",
" Return\n",
" ------\n",
" c: ndarray\n",
" result matrix\n",
" \"\"\"\n",
" c = numpy.zeros((a.shape[0], b.shape[1])) \n",
" for i in range(a.shape[0]): \n",
" for k in range(a.shape[1]): \n",
" for j in range(b.shape[1]): \n",
" c[i, j] += a[i, k] * b[k, j]\n",
" return c "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Now, elements in c and b are accessed in the proper order and a[i, k] is constant for the loop. This changes our estimate, because, now we read 16 bytes/op. This gives us a maximum of 120 GB/s / 16 bytes/op = 7.5 GFLOP/s (15 GB/s). #Todo: shouldn't the last two not be gops and GFLOPs andtead of GFLOPs and GB/s??"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Effect on matrix multiplication"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"t_dot2 = %timeit -o dot2(a, b)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"print(2e-9 * n**3 / t_dot2.best, \"GFLOP/s\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"This is better than what we would expect from the bandwidth estimate, probably due to caching. One way to test this is to make the matrix larger, so that it doesn't fit in cache anymore."
"This is quite close to my estimate and shows that I have to increase the number of operations per byte loaded from main memory. Improving vectorization or using multiple threads wouldn't help.\n",
"\n",
"To improve cache utilization, we have to change the algorithm. One way to improve the performance of the matrix multiplication is blocking (aka tiling). This is done, e.g., in OpenBLAS or Intel's Math Kernel Library (MKL)."