diff --git a/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-solution.ipynb b/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-solution.ipynb
index cb7fcfd085171153092145dd537e0927063a8ac6..5208f9bd748fa633251b20a7cfba86bcd94f7acf 100644
--- a/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-solution.ipynb
+++ b/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-solution.ipynb
@@ -1 +1,2416 @@
-{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Hands-On Performance Optimization\n", "_Supercomputing 2018 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 12th 2018_\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.\n", "\n", "## Jupyter notebook execution\n", "\n", "When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n", "\n", "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n", "\n", "If you want you also can get a [terminal](/terminals/1) in your browser.\n", "\n", "## Terminal fallback\n", "\n", "The tasks are place in directories named `Task[1-3]`.\n", "\n", "Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Setup\n", "\n", "This hands-on session requires of GCC 6.4.0. By loading the `sc18/handson2` module before invoking this Notebook, we took care of also loading GCC 6.4.0 into the environment."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Tasks<a name=\"top\"></a>\n", "\n", "This session comes with multiple tasks, each one to be found in the respective sub-directory `Task[1-3]`. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n", "\n", "Please choose from the task below.\n", "\n", "\n", "* [Task 1](#task1): Compile Flags  \n", "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback ([Solution 1](#solution0))\n", "\n", "* [Task 2](#task2): Software Prefetching  \n", "Improve performance of the CPU Jacobi solver with software prefetching ([Solution 2](#solution1))\n", "\n", "* [Task 3](#task3): OpenMP  \n", "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance ([Solution 3](#solution2))\n", "  \n", "* [Suvery](#survey) Please remember to take the survey !\n", "    \n", "### Make Targets <a name=\"make\"></a>\n", "\n", "For all tasks we have defined the following make targets. \n", "\n", "* __poisson2d__:  \n", "  build `poisson2d` binary (default)\n", "* __run__:  \n", "   run `poisson2d` with default parameters\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Task 1: Compile Flags <a name=\"task1\"></a>\n", "\n", "\n", "### Overview\n", "\n", "The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver  \n", "\n", "Your task is to:\n", "\n", "* Optimize performance with `-Ofast` flag\n", "* Optimize performance with profile directed feedback \n", "\n", "First, change the working directory to `Task1`."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task1\n"]}], "source": ["%cd Task1"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part A: `-Ofast` vs. `-O3`\n", "\n", "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. Right now, the Makefile specifies `-O3` as the optimization flag. Compile the code using `make` and run it with `make run` in the next two cells."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"]}], "source": ["!make"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n", "Job <5033> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "1.13user 0.00system 0:01.15elapsed 97%CPU (0avgtext+0avgdata 10944maxresident)k\n", "2560inputs+0outputs (1major+264minor)pagefaults 0swaps\n"]}], "source": ["!make run"]}, {"cell_type": "markdown", "metadata": {}, "source": ["You can use the GNU _perf_ tool to profile the application using the `perf` command (see below) and see the top time-consuming functions."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "[ perf record: Woken up 1 times to write data ]\n", "[ perf record: Captured and wrote 0.172 MB perf.O3.data (4125 samples) ]\n", "# To display the perf.data header info, please use --header/--header-only options.\n", "#\n", "#\n", "# Total Lost Samples: 0\n", "#\n", "# Samples: 4K of event 'cycles:u'\n", "# Event count (approx.): 3867635297\n", "#\n", "# Overhead  Command    Shared Object      Symbol                                  \n", "# ........  .........  .................  ........................................\n", "#\n", "    72.02%  poisson2d  poisson2d          [.] 00000040.plt_call.fmax@@GLIBC_2.17\n", "    10.16%  poisson2d  poisson2d          [.] poisson2d_reference\n", "     9.99%  poisson2d  poisson2d          [.] main\n", "     4.69%  poisson2d  libc-2.17.so       [.] __memcpy_power7\n", "     2.23%  poisson2d  libm-2.17.so       [.] __fmaxf\n", "     0.75%  poisson2d  libm-2.17.so       [.] __exp_finite\n", "     0.07%  poisson2d  poisson2d          [.] 00000040.plt_call.memcpy@@GLIBC_2.17\n", "     0.02%  poisson2d  poisson2d          [.] check_results\n", "     0.02%  poisson2d  libm-2.17.so       [.] __GI___exp\n", "     0.01%  poisson2d  ld-2.17.so         [.] _dl_relocate_object\n", "     0.01%  poisson2d  [kernel.kallsyms]  [k] arch_local_irq_restore\n", "     0.00%  poisson2d  ld-2.17.so         [.] _dl_new_object\n", "     0.00%  poisson2d  ld-2.17.so         [.] _start\n", "\n", "\n", "#\n", "# (Tip: Show user configuration overrides: perf config --user --list)\n", "#\n"]}], "source": ["# perf record creates a perf.data file \n", "!perf record -o perf.O3.data -e cycles ./poisson2d\n", "# perf report opens the perf.data file \n", "!perf report -i perf.O3.data | cat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**TASK**: Now change the optimization flag in the [Makefile](/edit/Task1/Makefile) to `-Ofast` and repeat the steps in the following cell. In case you follow along non-interactive, call `make` and `make run` in your shell. (If you are in the Jupyter Notebook, you can actually click the link of the [Makefile](/edit/Task1/Makefile). In other cases, use `vim` which is installed on Ascent.)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"]}], "source": ["!make"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n", "Job <5034> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.51user 0.00system 0:00.52elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+264minor)pagefaults 0swaps\n"]}], "source": ["!make run"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "[ perf record: Woken up 1 times to write data ]\n", "[ perf record: Captured and wrote 0.086 MB perf.Ofast.data (1889 samples) ]\n", "# To display the perf.data header info, please use --header/--header-only options.\n", "#\n", "#\n", "# Total Lost Samples: 0\n", "#\n", "# Samples: 1K of event 'cycles:u'\n", "# Event count (approx.): 1765737747\n", "#\n", "# Overhead  Command    Shared Object  Symbol                 \n", "# ........  .........  .............  .......................\n", "#\n", "    44.65%  poisson2d  poisson2d      [.] main\n", "    43.84%  poisson2d  poisson2d      [.] poisson2d_reference\n", "    10.28%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n", "     1.12%  poisson2d  libm-2.17.so   [.] __exp_finite\n", "     0.05%  poisson2d  poisson2d      [.] check_results\n", "     0.03%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n", "     0.02%  poisson2d  libc-2.17.so   [.] __readdir64\n", "     0.01%  poisson2d  ld-2.17.so     [.] _dl_new_object\n", "     0.00%  poisson2d  ld-2.17.so     [.] _start\n", "\n", "\n", "#\n", "# (Tip: System-wide collection from all CPUs: perf record -a)\n", "#\n"]}], "source": ["# perf record creates a perf.data file \n", "!perf record -o perf.Ofast.data -e cycles ./poisson2d\n", "# perf report opens the perf.data file \n", "!perf report -i perf.Ofast.data | cat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If `perf` is unavailable to you on other machines, you can also study the disassembly with `objdump`: `objdump -lSd ./poisson2d > poisson2d.dis` (feel free to experiment with this in the Notebook as well, just prefix the command with a `!` to execute it.)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["####  Interpretation\n", "\n", "Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with `-Ofast` which enables `\u2013ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part B: Profile-directed Feedback\n", "\n", "For the first level of optimization we saw `Ofast` cut the execution time of the `O3` binary by almost half.\n", "\n", "We can optimize the performance further by using profile directed feedback optimization.\n", "\n", "To compile using profile directed feedback with the GCC compiler we need to do the following steps\n", "\n", "1. We need to first build a training binary using `-fprofile-generate`; this instructs the compiler to record hot path information \n", "2. Run the training binary with a smaller input size; you should see a `.gcda` file generated which stores hot path information for further optimization by the compiler \n", "3. build the final binary using `-fprofile-use` which uses the profile information in the `.gcda` file \n", "4. Compare the performance of the final binary with the original `Ofast` binary \n", "\n", "**TASK**: First, search for `TODO1` in the [Makefile](/edit/Task1/Makefile). It defines an additional compilation flag for `gcc`. Insert `-fprofile-generate=FOLDER` there with FOLDER pointing to `$$SC18_DIR_SCRATCH`, your personal write-directory (the double dollar signs are intentional as they are used to escape in the GNU Make syntax).\n", "\n", "After editing, run the following two cells to train your program."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-generate=$SC18_DIR_SCRATCH\" -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-generate=$SC18_DIR_SCRATCH\" -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_train  -lm\n"]}], "source": ["!make poisson2d_train"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_train 200 64 64\n", "Job <5035> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 200 iterations on 64 x 64 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.248743\n", "  100, 0.124046\n", "Calculate current execution.\n", "    0, 0.248743\n", "  100, 0.124046\n", "0.00user 0.00system 0:00.10elapsed 5%CPU (0avgtext+0avgdata 5248maxresident)k\n", "512inputs+0outputs (0major+115minor)pagefaults 0swaps\n", "mv $SC18_DIR_SCRATCH/*.gcda .\n"]}], "source": ["!make run_train"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, a `.gcda` file exists in the directory which can be used for an profile-accelerated subsequent run.\n", "\n", "**TASK**: Edit the [Makefile](/edit/Task1/Makefile) again, this time modifying `TODO2` to be equivalent to `-fprofile-use`. A directory is not needed as we copied the gcda file into the current directory.\n", "\n", "Run the following cells in order to build using the newly added flag and then run with the profile-accelerated version."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-use\" -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-use\" -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_profile  -lm\n"]}], "source": ["!make poisson2d_profile"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_profile\n", "Job <5036> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.47user 0.00system 0:00.48elapsed 98%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"]}], "source": ["!make run_profile"]}, {"cell_type": "markdown", "metadata": {}, "source": ["What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### References\n", "\n", "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n", "2. https://perf.wiki.kernel.org/index.php/Tutorial"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Task 2:<a name=\"task2\"></a> Software Pretechting\n", "\n", "\n", "### Overview\n", "\n", "Study the difference of program execution time of different optimization levels with and without software prefetching.\n", "\n", "First, change directory to that of Task 2"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task2\n"]}], "source": ["%cd ../Task2"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part A: Running"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Look at the [Makefile](/edit/Task2/Makefile) and work on the TODOs. Please implement compile flags as mentioned in the Makefile target name.\n", "\n", "Afterwards, compile each target with the following cells and submit them to the batch system. Follow along accordingly in the non-interactive version of this Notebook."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_pref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_pref\n", "Job <5037> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "1.12user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10880maxresident)k\n", "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"]}], "source": ["!make run_o3_pref"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_pref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_pref\n", "Job <5038> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.77user 0.00system 0:00.77elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+264minor)pagefaults 0swaps\n"]}], "source": ["!make run_ofast_pref"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_nopref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_nopref\n", "Job <5039> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "1.13user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10944maxresident)k\n", "256inputs+0outputs (0major+266minor)pagefaults 0swaps\n"]}], "source": ["!make run_o3_nopref"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_nopref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_nopref\n", "Job <5040> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.82user 0.00system 0:00.82elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"]}], "source": ["!make run_ofast_nopref"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Do you notice the impact difference with optimization levels? It's always important to carefully study the interplay of flags."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part B: Analysis of Instructions\n", "\n", "Compilation with the software prefetching flag causes the compiler to generate the `__dcbt` and `__dcbtst`  instructions that prefetch memory values to L3.\n", "\n", "Verify it using `objdump -lSd` on each file (`poisson2d_o3_pref`, `poisson2d_ofast_pref`, `poisson2d_o3_nopref`, `poisson2d_ofast_nopref`). You might want to grep for `dcb`."]}, {"cell_type": "code", "execution_count": 19, "metadata": {"exercise": "solution"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["poisson2d_o3_pref:\n", "poisson2d_ofast_pref:\n", "    10000da0:\tec f1 00 7c \tdcbtst  0,r30\n", "    10000da4:\t2c fa 00 7c \tdcbt    0,r31\n", "    10000da8:\t2c 62 00 7c \tdcbt    0,r12\n", "    10000dac:\t2c b2 00 7c \tdcbt    0,r22\n", "    10000dcc:\t2c e2 00 7c \tdcbt    0,r28\n", "    10000dd0:\t2c ea 00 7c \tdcbt    0,r29\n", "    100010b4:\t2c 62 00 7c \tdcbt    0,r12\n", "    100010b8:\t2c 5a 00 7c \tdcbt    0,r11\n", "    100010c4:\tec 19 00 7c \tdcbtst  0,r3\n", "    100010cc:\t2c 22 00 7c \tdcbt    0,r4\n", "    100010d0:\t2c ea 00 7c \tdcbt    0,r29\n", "    100010d4:\t2c f2 00 7c \tdcbt    0,r30\n", "    100010dc:\t2c fa 00 7c \tdcbt    0,r31\n", "poisson2d_o3_nopref:\n", "poisson2d_ofast_nopref:\n"]}], "source": ["for f in [\"poisson2d_o3_pref\", \"poisson2d_ofast_pref\", \"poisson2d_o3_nopref\", \"poisson2d_ofast_nopref\"]:\n", "    print(\"{}:\".format(f))\n", "    objdump -lSd $f |\u00a0grep dcb"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you feel up to the task, you can study the number of L3 cache misses using the corresponding performance counter, `PM_L3_MISS`. Either use your knowledge from Hands-On 1, or use the following call to `perf`, in which we already converted the named counter to a raw counter address."]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Job <5048> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "\n", " Performance counter stats for './poisson2d_ofast_nopref':\n", "\n", "        2829292169      cycles:u                                                    \n", "         136018637      r168a4:u                                                    \n", "\n", "       0.826136863 seconds time elapsed\n", "\n", "Job <5049> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "\n", " Performance counter stats for './poisson2d_ofast_pref':\n", "\n", "        2654990243      cycles:u                                                    \n", "         128824827      r168a4:u                                                    \n", "\n", "       0.775593651 seconds time elapsed\n", "\n"]}], "source": ["for f in [\"poisson2d_ofast_nopref\", \"poisson2d_ofast_pref\"]:\n", "    !eval $$SC18_SUBMIT_CMD perf stat -e cycles,r168a4 ./$f\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### References\n", "\n", "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n", "2. https://www.gnu.org/software/gcc/projects/prefetch.html"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Task 3: OpenMP\n", "<a name=\"task3\"></a>\n", "\n", "\n", "### Overview\n", "\n", "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores.\n", "\n", "First, we need to change directory to that of Task3."]}, {"cell_type": "code", "execution_count": 1, "metadata": {"ExecuteTime": {"end_time": "2018-11-07T13:47:57.724441Z", "start_time": "2018-11-07T13:47:57.718745Z"}}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task3\n"]}], "source": ["%cd ../Task3"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part A: Implement OpenMP Pragmas; Compilation\n", "\n", "**Task**: Please add the correct OpenMP pragmas to the source code and compilations flags to enable OpenMP.\n", "\n", "* **pragmas**: Look at the TODOs in [`poisson2d.c`](/edit/Task3/poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for`\n", "* **Compilation**: Please add compilation flags enabling OpenMP in GCC to the [Makefile](/edit/Task3/Makefile). The flag in question is `-fopenmp`.\n", "\n", "Edit the files with the links above if you are running the interactive version of the Notebook or navigate to `poisson2d.c` and `Makefile` yourself in case you run the non-interactive version.\n", "\n", "Afterwards, compile and run the application with the following cells. Non-interactive: Follow along accordingly in the shell."]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"]}], "source": ["!make poisson2d"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n", "Job <5052> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "500x500: Ref:   0.2571 s, This:   0.2946 s, speedup:     0.87\n", "1.48user 0.00system 0:00.56elapsed 263%CPU (0avgtext+0avgdata 9664maxresident)k\n", "0inputs+0outputs (0major+273minor)pagefaults 0swaps\n"]}], "source": ["!make run"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The command to submit a job to the batch system is prepared in an environment variable `$SC18_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to increase the work of the application."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Job <5344> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 1000 iterations on 1000 x 100 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249743\n", "  100, 0.210080\n", "  200, 0.184635\n", "  300, 0.166526\n", "  400, 0.152783\n", "  500, 0.141890\n", "  600, 0.132978\n", "  700, 0.125511\n", "  800, 0.119142\n", "  900, 0.113632\n", "Calculate current execution.\n", "    0, 0.249743\n", "  100, 0.210080\n", "  200, 0.184635\n", "  300, 0.166526\n", "  400, 0.152783\n", "  500, 0.141890\n", "  600, 0.132978\n", "  700, 0.125511\n", "  800, 0.119142\n", "  900, 0.113632\n", "1000x100: Ref:   1.9872 s, This:   0.2385 s, speedup:     8.33\n"]}], "source": ["!eval $SC18_SUBMIT_CMD ./poisson2d 1000 1000"]}, {"cell_type": "markdown", "metadata": {}, "source": ["What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n", "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."]}, {"cell_type": "code", "execution_count": 23, "metadata": {"exercise": "solution"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Threads: 1\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3037 s, This:   2.8420 s, speedup:     0.81\n", "Threads: 2\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.2998 s, This:   1.4320 s, speedup:     1.61\n", "Threads: 4\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3135 s, This:   0.7168 s, speedup:     3.23\n", "Threads: 8\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3145 s, This:   0.5278 s, speedup:     4.39\n", "Threads: 10\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3153 s, This:   0.4848 s, speedup:     4.78\n", "Threads: 20\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3190 s, This:   0.2016 s, speedup:    11.50\n", "Threads: 40\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3243 s, This:   0.3057 s, speedup:     7.60\n"]}], "source": ["for omp_num in [1, 2, 4, 8, 10, 20, 40]:\n", "    print(\"Threads: {}\".format(omp_num))\n", "    !eval OMP_NUM_THREADS=$omp_num $$SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 | grep speedup"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part B: Bindings\n", "\n", "Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!\n", "\n", "There are applications which can be used to determine the configuration of the processor. Among those are:\n", "\n", "* `lscpu`: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.\n", "* `ppc64_cpu --smt`: Specifically for POWER, this tool can give information about the number of simulations threads running per core (*SMT*, Simulataion Multi-Threading).\n", "\n", "Run `ppc64_cpu --smt` to find out about the threading configuration of Ascent!"]}, {"cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Job <5076> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "SMT=4\n"]}], "source": ["!eval $SC18_SUBMIT_CMD ppc64_cpu --smt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["There are more sources information available\n", "\n", "* `/proc/cpuinfo`: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with `cat`\n", "* `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list`: Holds information about thread siblings for given CPU core (`cpu0` in this case). Use it to find out which thread is mapped to which core."]}, {"cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0-3\n", "4-7\n"]}], "source": ["!cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n", "!cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"]}, {"cell_type": "markdown", "metadata": {}, "source": ["There are various environment variables available within OpenMP (and GCC) to specify binding of threads to cores. See, for instance, the [online documentation of GCC libgomp](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html). Examples are `OMP_PLACES` or `GOMP_CPU_AFFINITY`.\n", "\n", "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n", "\n", "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n", "\n", "What's your maximum speedup?"]}, {"cell_type": "code", "execution_count": 24, "metadata": {"exercise": "solution"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Affinity: 0,1,2,3\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "  OMP_PLACES = '{0},{1},{2},{3}'\n", "1000x100: Ref:   1.9854 s, This:   0.2326 s, speedup:     8.53\n", "Affinity: 0,5,9,13\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "  OMP_PLACES = '{0},{5},{9},{13}'\n", "1000x100: Ref:   1.9828 s, This:   0.0833 s, speedup:    23.80\n"]}], "source": ["for affinity in [\"0,1,2,3\", \"0,5,9,13\"]:\n", "    print(\"Affinity: {}\".format(affinity))\n", "    !eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=$affinity OMP_NUM_THREADS=4 $$SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### References\n", "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["# Survey<a name=\"survey\"></a>\n", "\n", "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc18-eval)."]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7"}}, "nbformat": 4, "nbformat_minor": 2}
\ No newline at end of file
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hands-On Performance Optimization\n",
+    "_Supercomputing 2019 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 18th 2019_\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.\n",
+    "\n",
+    "## Jupyter notebook execution\n",
+    "\n",
+    "When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n",
+    "\n",
+    "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
+    "\n",
+    "If you want you also can get a terminal in your browser; just open it via the \u00bbNew Launcher\u00ab button (`+`).\n",
+    "\n",
+    "## Terminal fallback\n",
+    "\n",
+    "The tasks are place in directories named `Task[1-3]`.\n",
+    "\n",
+    "Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc (GCC) 9.2.0\n",
+      "Copyright (C) 2019 Free Software Foundation, Inc.\n",
+      "This is free software; see the source for copying conditions.  There is NO\n",
+      "warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!gcc --version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tasks<a name=\"top\"></a>\n",
+    "\n",
+    "This session comes with multiple tasks, each one to be found in the respective sub-directory `Task[1-3]`. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n",
+    "\n",
+    "Please choose from the task below.\n",
+    "\n",
+    "\n",
+    "* [Task 1](#task1): __Basic compiler optimization flags and compiler annotations__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback. Learn about compiler annotations.\n",
+    "\n",
+    "* [Task 2](#task2): __Optimization via Prefetching controlled by compiler__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance. \n",
+    "* [Task 3](#task3): __Optimization via OpenMP controlled by compiler and the system__\n",
+    "\n",
+    "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance. \n",
+    "  \n",
+    "* [Suvery](#survey) Please remember to take the survey !\n",
+    "    \n",
+    "### Make Targets <a name=\"make\"></a>\n",
+    "\n",
+    "For all tasks we have defined the following make targets. \n",
+    "\n",
+    "* __poisson2d__:  \n",
+    "  build `poisson2d` binary (default)\n",
+    "* __run__:  \n",
+    "   run `poisson2d` with default parameters\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 1: Basic compiler optimization flags and compiler annotations <a name=\"task1\"></a>\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver  \n",
+    "\n",
+    "Your task is to:\n",
+    "\n",
+    "* Optimize performance with `-Ofast` flag\n",
+    "* Verify the cause for performance improvement by viewing perf profiles of O3 and Ofast binaries \n",
+    "* Optimize performance with profile directed feedback \n",
+    "* Generate compiler annotations/remarks to understand the optimizations done by the compiler with and without profile directed feedback \n",
+    "\n",
+    "\n",
+    "First, change the working directory to `Task1`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task1\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd Task1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: `-Ofast` vs. `-O3`\n",
+    "\n",
+    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. As in the previous task, we use a `Makefile` for compilation. The `Makefile` targets `poisson2d_O3` and `poisson2d_Ofast` are already prepared. \n",
+    "\n",
+    "**TASK**: Add `-O3` as the optimization flag for the `poisson2d_O3` target by using the corresponding `CFLAGS` definition. There are notes relating to this Task 1 in the header of the `Makefile`. Compile the code using `make` as indicated below and run with the `Make` targets `run`, `run_perf` and `run_perf_recrep`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_O3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24897> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's have a look at the output of the `Makefile` target `run_perf`. It invokes the GNU _perf_ tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to _perf_."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24898> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16264721613      cycles:u                                                    \n",
+      "       28463907825      instructions:u            #    1.75  insn per cycle                                            \n",
+      "\n",
+      "       4.738444892 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we run the makefile with target `run_perf_recrep` that prints the top routines of the application in terms of hotness by using a combination of `perf record ./app` and `perf report`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24899> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "[ perf record: Woken up 3 times to write data ]\n",
+      "[ perf record: Captured and wrote 0.739 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (19102 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24900> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "# To display the perf.data header info, please use --header/--header-only options.\n",
+      "#\n",
+      "#\n",
+      "# Total Lost Samples: 0\n",
+      "#\n",
+      "# Samples: 19K of event 'cycles:u'\n",
+      "# Event count (approx.): 16254596654\n",
+      "#\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
+      "#\n",
+      "    65.50%  poisson2d  poisson2d      [.] 00000038.plt_call.fmax@@GLIBC_2.17\n",
+      "    21.21%  poisson2d  poisson2d      [.] main\n",
+      "     9.18%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     3.28%  poisson2d  libm-2.17.so   [.] __fmaxf\n",
+      "     0.74%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libm-2.17.so   [.] __GI___exp\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _wordcopy_fwd_aligned\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
+      "\n",
+      "\n",
+      "#\n",
+      "# (Tip: Limit to show entries above 5% only: perf report --percent-limit 5)\n",
+      "#\n"
+     ]
+    }
+   ],
+   "source": [
+    "# run_perf_recrep displays the top hot routines \n",
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Now add the optimization flag `Ofast` to the `CFLAGS` for target `poisson2d_Ofast`. Compile the program with the target `poisson2d_Ofast` and run and analyse it as before with `run`, `run_perf` and `run_perf_recrep`.\n",
+    "\n",
+    "What difference do you see?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24901> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.41user 0.00system 0:02.41elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast \n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, run a `perf`-instrumented version:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24902> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        8258991976      cycles:u                                                    \n",
+      "       12013091172      instructions:u            #    1.45  insn per cycle                                            \n",
+      "\n",
+      "       2.408703909 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Generate the list of top routines in terms of hotness:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24903> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "[ perf record: Woken up 2 times to write data ]\n",
+      "[ perf record: Captured and wrote 0.382 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (9728 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24904> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "# To display the perf.data header info, please use --header/--header-only options.\n",
+      "#\n",
+      "#\n",
+      "# Total Lost Samples: 0\n",
+      "#\n",
+      "# Samples: 9K of event 'cycles:u'\n",
+      "# Event count (approx.): 8268811890\n",
+      "#\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
+      "#\n",
+      "    81.12%  poisson2d  poisson2d      [.] main\n",
+      "    17.97%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     0.79%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.02%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.01%  poisson2d  libc-2.17.so   [.] vfprintf@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libc-2.17.so   [.] _dl_addr\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] open_path\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] init_tls\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
+      "\n",
+      "\n",
+      "#\n",
+      "# (Tip: For tracepoint events, try: perf report -s trace_fields)\n",
+      "#\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If `perf` is unavailable to you on other machines, you can also study the disassembly with `objdump`: `objdump -lSd ./poisson2d > poisson2d.dis` (feel free to experiment with this in the Notebook as well, just prefix the command with a `!` to execute it.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "####  Interpretation\n",
+    "\n",
+    "Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with `-Ofast` which enables `\u2013ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Profile-directed Feedback\n",
+    "\n",
+    "For the first level of optimization we see that `Ofast` cut the execution time of the `O3` binary by almost half.\n",
+    "\n",
+    "We can optimize the performance further by using profile-directed feedback optimization.\n",
+    "\n",
+    "To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:\n",
+    "\n",
+    "1. Instrument binary;\n",
+    "2. Run binary with training, gather profile information;\n",
+    "3. Use profile information to generate optimized binary.\n",
+    "\n",
+    "\n",
+    "Step 1 is achieved by compiling the binary with the correct flag \u2013\u00a0`-fprofile-generate`. In our case, we need to specify an output location, which should be `$(SC19_DIR_SCRATCH)`.\n",
+    "\n",
+    "Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension `.gcda`) is written to the directory specified during compilation.\n",
+    "\n",
+    "For Step 3, the binary is once again compiled, but this time using the `gcda` profile just generated. The according flag is `-fprofile-use`, which we set to `$(SC19_DIR_SCRATCH)` as well.\n",
+    "\n",
+    "In our `Makefile` at hand, we prepared the steps already for you in the form of two targets.\n",
+    "\n",
+    "* `poisson2d_train`: Will compile the binary with profile-directed feedback\n",
+    "* `poisson2d_ref`: Will take a generated profile and compile a new, optimized binary\n",
+    "\n",
+    "By using dependencies, between these two targets a profile run is launched.\n",
+    "\n",
+    "**TASK**: Edit the [Makefile](`Makefile`) and add the `-fprofile-*` flags to the `CFLAGS` of `poisson2d_train` and\n",
+    "`poisson2d_ref` as outline in the file.\n",
+    "\n",
+    "After that, you may launch them with the following cells (`gen_profile` is a meta-target and uses `poisson2d_train` and `poisson2d_ref`). If you need to clean the generated profile, you may use `make clean_profile`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24905> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_ref -lm \n",
+      "cp poisson2d_ref poisson2d\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make gen_profile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24906> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.28user 0.01system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Great! It is! In our tests, this shaved off another 5%."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's also measure instructions and cycles"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24907> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        7925983538      cycles:u                                                    \n",
+      "       12253080719      instructions:u            #    1.55  insn per cycle                                            \n",
+      "\n",
+      "       2.313471365 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Compiler annotations/Remarks\n",
+    "\n",
+    "Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done. \n",
+    "\n",
+    "To generate compiler annotations using GCC, one uses `-fopt-info-all`. If you only want to see the missed options, use the option `-fopt-info-missed` instead of `-fopt-info-all`. See also the [documentation of GCC regarding the flag](https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info).\n",
+    "\n",
+    "**TASK**: Have a looK at the `CFLAGS` of the `Makefile` target `poisson2d_Ofast_info`. Add the flag `-fopt-info-all` to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use \u2013\u00a0for example \u2013\u00a0`-fopt-info-all=(SC19_DIR_SCRATCH)/filename`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all poisson2d.c poisson2d_reference.o -o poisson2d_Ofast_info  -lm\n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.\n",
+    "\n",
+    "**TASK**: \n",
+    "Adapt the `CFLAGS` of `poisson2d_ref_info` to include `-fopt-info-all` **and** the profile input of `-fprofile-use=\u2026` here. *(Be advised: Long output!)*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "Increasing alignment of decl: __gcov0.main\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_D_00100_1_main/48 -> __gcov_exit/55, function body not available\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_I_00100_0_main/47 -> __gcov_init/54, function body not available\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 295->295 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:122:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:88:5: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:72:5: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_337, ny_124, nx_286);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_316, error_118);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_127);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_311);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_122);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_129 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_132 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_140 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 53\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 50\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 47\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 3 times (header execution count 9800)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 33\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 30\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 42\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 40\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 60\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 23\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 3 times (header execution count 100)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 12\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 16\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_init (&*.LPBX0);\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_exit ();\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24908> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "libgcov profiling error:/gpfs/wolf/trn003/scratch/aherten//#autofs#nccsopen-svm1_home#aherten#SC19-Tutorial#3-Optimizing_POWER#Handson#Task1#poisson2d.gcda:overwriting an existing profile data with a different timestamp\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c poisson2d_reference.o -o poisson2d_ref_info  -lm\n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 47\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 44\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 40\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 27\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 24\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 37\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 35\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 51\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 18\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 7 times (header execution count 99)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 9\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 14\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_ref_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Comparing the annotations generated of a plain `-Ofast` optimization level and the one generated at `-Ofast` and profile directed feedback, we observe that many more optimizations are possible due to profile information.\n",
+    "\n",
+    "For instance you will see annotations such as\n",
+    "```\n",
+    "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+    "```\n",
+    "\n",
+    "The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "\n",
+    "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
+    "2. https://perf.wiki.kernel.org/index.php/Tutorial"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 2:<a name=\"task2\"></a> Impact of Prefetching on Performance\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "* Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
+    "* Verify the impact by measuring cache counters with and without prefetching.\n",
+    "* Learn how to modify contents of DSCR (*Data Stream Control Register*) using IBM XL compiler and study the impact with different values to DSCR. \n",
+    "\n",
+    "But first, lets change directory to that of Task 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task2\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd ../Task2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: Software Prefetching"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Look at the Makefile and work on the TODOs. \n",
+    "\n",
+    "- First generate a `-Ofast`-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!\n",
+    "- Modify the `Makefile` to add the option for software prefetching (`-fprefetch-loop-arrays`). Compare performance of `-Ofast` with and without software prefetching"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rm -f poisson2d poisson2d*.o\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make clean"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "make: `poisson2d' is up to date.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24911> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.39user 0.01system 0:02.40elapsed 100%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24912> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        8271503902      cycles:u                                                    \n",
+      "         481152478      r168a4:u                                                    \n",
+      "\n",
+      "       2.412224884 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
+      "cp poisson2d_pref poisson2d\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24919> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1.92user 0.00system 0:01.93elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24920> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        6586609284      cycles:u                                                    \n",
+      "         459879452      r168a4:u                                                    \n",
+      "\n",
+      "       1.925399505 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_pref CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Repeat the experiment with the `-O3` flag. Have a look at the `Makefile` and the outlined TODO. There's a position to easily adapt `-Ofast`\u2192`-O3`!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 100,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24923> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24924> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16445764669      cycles:u                                                    \n",
+      "         645094089      r168a4:u                                                    \n",
+      "\n",
+      "       4.792567763 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24925> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.74user 0.00system 0:04.74elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24926> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16239159454      cycles:u                                                    \n",
+      "         631061431      r168a4:u                                                    \n",
+      "\n",
+      "       4.730144897 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_pref CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Observing the results, we see that SW Prefetching seems to help at `-Ofast` but not at `-O3`. We can use the steps described in the the next section to verify that the compiler has not inserted any SW prefetch operations at`-O3` at all. That is because in the `-O3` binary the time is dominated by `__fmax` call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of `fmax()`\n",
+    "GCC may add further loop optimizations such as unrolling upon invocation of `\u2013fprefetch-loop-arrays`.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Analysis of Instructions\n",
+    "\n",
+    "Compilation of the `-Ofast` binary with the software prefetching flag causes the compiler to generate the `dcb*`  instructions that prefetch memory values to L3."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: \n",
+    "Run `$(SC19_SUBMIT_CMD) objdump -lSd` on each binary file (`-O3`, `-Ofast` with prefetch/no prefetch).\n",
+    "Look for instructions beginning with `dcb`\n",
+    "At what optimization levels does the compiler generate software prefetching instructions?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 114,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=gcc -B poisson2d_pref\n",
+    "!objdump -lSd ./poisson2d_pref > poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 116,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    10000b28:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000b30:\t2c ba 00 7c \tdcbt    0,r23\n",
+      "    10000b38:\t2c b2 00 7c \tdcbt    0,r22\n",
+      "    10000b50:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000b58:\tec b9 00 7c \tdcbtst  0,r23\n",
+      "    10000b80:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000e64:\t2c 92 00 7c \tdcbt    0,r18\n",
+      "    10000e68:\t2c 9a 00 7c \tdcbt    0,r19\n",
+      "    10000e6c:\t2c a2 00 7c \tdcbt    0,r20\n",
+      "    10000e70:\t2c aa 00 7c \tdcbt    0,r21\n",
+      "    10000e7c:\t2c b2 00 7c \tdcbt    0,r22\n",
+      "    10000e80:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000e94:\tec b9 00 7c \tdcbtst  0,r23\n"
+     ]
+    }
+   ],
+   "source": [
+    "!grep dcb poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Changing Values of DSCR via compiler flags\n",
+    "\n",
+    "This task requires using the IBM XL compiler. It should be already in your environment.\n",
+    "\n",
+    "\n",
+    "We saw the impact of software prefetching in the previous subsection. \n",
+    "In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. \n",
+    "In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching. \n",
+    "\n",
+    "IBM XL compiler has an option `-qprefetch=dscr=<val>` that can be used for this purpose.\n",
+    "Compiling with `-qprefetch=dscr=1` turns off the prefetcher. One can give various values such as `-qprefetch=dscr=4`, `-qprefetch=dscr=7` etc. to control aggressiveness of prefetching.\n",
+    "\n",
+    "For this exercise we use `make CC=xlc_r` to illustrate the performance impact.\n",
+    "    \n",
+    "\n",
+    "**Task** Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the `Makefile`: Add `qprefetch=dscr=1` to the `CFLAGS` and rebuild the application and note the performance. Which one is faster? \n",
+    "\n",
+    "In general, applications benefit with the default settings of hardware DSCR register (`-qprefetch=dscr=0`). However, certain applications also benefit with prefetching turned off. \n",
+    "\n",
+    "It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL at default DSCR value"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 117,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  poisson2d.c -o poisson2d  -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24927> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "2.26user 0.00system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+477minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=xlc_r -B poisson2d\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL with DSCR value turned off"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  -qprefetch=dscr=1 poisson2d.c -o poisson2d_dscr  -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24929> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.58user 0.00system 0:04.59elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "0inputs+0outputs (0major+476minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_dscr CC=xlc_r -B\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher? "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "The DSCR register controls the operation of the HW Prefetcher on POWER9. It can be modified in the command line by `ppc64_cpu --dscr=<value>`. However this needs admin privileges. IBM XL offers a compiler flag to set the value through the compiler. `-qprefetch=dscr=1` turns off the prefetcher. Observing the results we see that the performance without the HW prefetcher is twice as bad as that with default prefetching. So we can conclude that Prefetching helps the Jacobi application. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "\n",
+    "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
+    "2. https://www.gnu.org/software/gcc/projects/prefetch.html\n",
+    "3. https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 3: OpenMP\n",
+    "<a name=\"task3\"></a>\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.\n",
+    "First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is `poisson2d_reference`. We parallelize only the main loop but not `poisson2d_reference`. The speedup is the performance gain seen in the main loop as compared to the reference loop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task3\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd ../Task3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: Implement OpenMP Pragmas; Compilation\n",
+    "\n",
+    "**Task**: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.\n",
+    "\n",
+    "* **Directives**: Look at the TODOs in [`poisson2d.c`](poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for` (and once it's `#pragma omp parallel for reduction(max:error)` \u2013\u00a0can you guess where?)\n",
+    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC and XL to the `Makefile`. For GCC, we need to add `-fopenmp` and the application needs to be linked with `-lgomp`. For XL, we need to add `-qsmp=omp` to the list of compilation flags. \n",
+    "\n",
+    "Afterwards, compile and run the application with the following commands."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -c -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp   poisson2d_reference.c -o poisson2d_reference.o -lm\n",
+      "gcc -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp  poisson2d.c poisson2d_reference.o -o poisson2d  -lm \n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The command to submit a job to the batch system is prepared in an environment variable `$SC19_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to invoke the application using the batch system. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24951> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1000x1000: Ref:   4.7430 s, This:   3.9363 s, speedup:     1.20\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval $SC19_SUBMIT_CMD ./poisson2d 1000 1000 1000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Inorder to run the parallel application, we need to set the number of threads using `OMP_NUM_THREADS`\n",
+    "What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n",
+    "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   4.7288 s, This:   4.9791 s, speedup:     0.95\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=1 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   4.7125 s, This:   2.4914 s, speedup:     1.89\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=2 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.1065 s, This:   1.3836 s, speedup:     1.52\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3868 s, This:   0.5272 s, speedup:     4.53\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=8 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3912 s, This:   0.4612 s, speedup:     5.18\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=10 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3864 s, This:   0.4037 s, speedup:     5.91\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=20 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3773 s, This:   0.3045 s, speedup:     7.81\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=40 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3819 s, This:   0.3081 s, speedup:     7.73\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=80 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Bindings\n",
+    "\n",
+    "Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!\n",
+    "\n",
+    "There are applications which can be used to determine the configuration of the processor. Among those are:\n",
+    "\n",
+    "* `lscpu`: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.\n",
+    "* `ppc64_cpu --smt`: Specifically for POWER, this tool can give information about the number of simulations threads running per core (*SMT*, Simulataion Multi-Threading).\n",
+    "\n",
+    "Run `ppc64_cpu --smt` to find out about the threading configuration of Ascent!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24465> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "SMT=4\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval $SC19_SUBMIT_CMD ppc64_cpu --smt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are more sources information available\n",
+    "\n",
+    "* `/proc/cpuinfo`: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with `cat`\n",
+    "* `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list`: Holds information about thread siblings for given CPU core (`cpu0` in this case). Use it to find out which thread is mapped to which core."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24949> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "0-3\n",
+      "Job <24950> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "4-7\n"
+     ]
+    }
+   ],
+   "source": [
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n",
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the [OMP_PLACES environment Variable](https://www.openmp.org/spec-html/5.0/openmpse53.html). We also have a GNU specific variable which can also be used to control affinity - `GOMP_CPU_AFFINITY`. Setting `GOMP_CPU_AFFINITY` is specific to GCC binaries but it internally serves the same function as setting `OMP_PLACES`. \n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Using `OMP_PLACES` for binding, and using some magical Python-Bash interplay:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: {0},{1},{2},{3}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{1},{2},{3}'\n",
+      "1000x1000: Ref:   4.7315 s, This:   3.9090 s, speedup:     1.21\n",
+      "Affinity: {0},{5},{9},{13}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{5},{9},{13}'\n",
+      "1000x1000: Ref:   4.6485 s, This:   1.2829 s, speedup:     3.62\n"
+     ]
+    }
+   ],
+   "source": [
+    "for affinity in [\"{0},{1},{2},{3}\", \"{0},{5},{9},{13}\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000  | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "In this case, we carry out the same experiment using `GOMP_CPU_AFFINITY` which essentially sets the same environment variable `OMP_PLACES`. Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: 0,1,2,3\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{1},{2},{3}'\n",
+      "1000x1000: Ref:   2.3964 s, This:   2.1361 s, speedup:     1.12\n",
+      "Affinity: 0,5,9,13\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{5},{9},{13}'\n",
+      "1000x1000: Ref:   2.3925 s, This:   0.7030 s, speedup:     3.40\n"
+     ]
+    }
+   ],
+   "source": [
+    "for affinity in [\"0,1,2,3\", \"0,5,9,13\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Great!\n",
+    "\n",
+    "If you still have time: The same experiments can be repeated with the IBM XL compiler. \n",
+    "The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is `-qsmp=omp`\n",
+    "\n",
+    "**Task**: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r -c -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp    poisson2d_reference.c -o poisson2d_reference.o -lm \n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "xlc_r -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS   time ./poisson2d\n",
+      "Job <24956> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "1000x1000: Ref:   5.6783 s, This:   2.6528 s, speedup:     2.14\n",
+      "21.56user 6.18system 0:08.37elapsed 331%CPU (0avgtext+0avgdata 23040maxresident)k\n",
+      "3200inputs+0outputs (2major+1098minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=xlc_r -B run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the parallel application with varying numbre of threads (`OMP_NUM_THREADS`) and note the performance improvement. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Just as in the GCC binary we see a similar speedup with higher number of threads until a certain point beyond which the benefit tapers off. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2561 s, This:   2.6432 s, speedup:     0.85\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=1 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3071 s, This:   1.5343 s, speedup:     1.50\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=2 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2617 s, This:   0.6936 s, speedup:     3.26\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2728 s, This:   0.3402 s, speedup:     6.68\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=8 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.1678 s, This:   0.2869 s, speedup:     7.56\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=10 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2813 s, This:   0.1452 s, speedup:    15.71\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=20 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3284 s, This:   0.0981 s, speedup:    23.75\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=40 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2918 s, This:   0.1439 s, speedup:    15.92\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=80 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we repeat the exercise of using the right binding of threads for the XL binary. `OMP_PLACES` pertains to the XL binary as well as it is an OpenMP variable.  `GOMP_CPU_AFFINITY` is specific to GCC binary so that cannot be used to set the binding.\n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "We are mixing Python with Bash (`!`) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two `$$`)\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: {0},{1},{2},{3}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES='{0},{1},{2},{3}' custom\n",
+      "1000x1000: Ref:   5.9792 s, This:   2.4122 s, speedup:     2.48\n",
+      "Affinity: {0},{5},{9},{13}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES='{0},{5},{9},{13}' custom\n",
+      "1000x1000: Ref:   2.3101 s, This:   0.6884 s, speedup:     3.36\n"
+     ]
+    }
+   ],
+   "source": [
+    "for affinity in [\"{0},{1},{2},{3}\", \"{0},{5},{9},{13}\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000  | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html\n",
+    "2. https://www.openmp.org/spec-html/5.0/openmpse53.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Survey<a name=\"survey\"></a>\n",
+    "\n",
+    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc19-eval)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git a/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-task.ipynb b/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-task.ipynb
index b9f231fa59e72fe0685688d76c4f8c076cc5f6c2..3b02952a3238816961fa9eb538040f08dba1f63a 100644
--- a/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-task.ipynb
+++ b/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization-task.ipynb
@@ -1,1206 +1,2085 @@
 {
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "# Hands-On Performance Optimization\n",
-        "_Supercomputing 2018 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 12th 2018_\n",
-        "\n",
-        "---"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.\n",
-        "\n",
-        "## Jupyter notebook execution\n",
-        "\n",
-        "When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n",
-        "\n",
-        "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
-        "\n",
-        "If you want you also can get a [terminal](/terminals/1) in your browser.\n",
-        "\n",
-        "## Terminal fallback\n",
-        "\n",
-        "The tasks are place in directories named `Task[1-3]`.\n",
-        "\n",
-        "Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Setup\n",
-        "\n",
-        "This hands-on session requires of GCC 6.4.0. By loading the `sc18/handson2` module before invoking this Notebook, we took care of also loading GCC 6.4.0 into the environment."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Tasks<a name=\"top\"></a>\n",
-        "\n",
-        "This session comes with multiple tasks, each one to be found in the respective sub-directory `Task[1-3]`. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n",
-        "\n",
-        "Please choose from the task below.\n",
-        "\n",
-        "\n",
-        "* [Task 1](#task1): Compile Flags  \n",
-        "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback ([Solution 1](#solution0))\n",
-        "\n",
-        "* [Task 2](#task2): Software Prefetching  \n",
-        "Improve performance of the CPU Jacobi solver with software prefetching ([Solution 2](#solution1))\n",
-        "\n",
-        "* [Task 3](#task3): OpenMP  \n",
-        "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance ([Solution 3](#solution2))\n",
-        "  \n",
-        "* [Suvery](#survey) Please remember to take the survey !\n",
-        "    \n",
-        "### Make Targets <a name=\"make\"></a>\n",
-        "\n",
-        "For all tasks we have defined the following make targets. \n",
-        "\n",
-        "* __poisson2d__:  \n",
-        "  build `poisson2d` binary (default)\n",
-        "* __run__:  \n",
-        "   run `poisson2d` with default parameters\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "[Back to Top](#top)\n",
-        "\n",
-        "---"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Task 1: Compile Flags <a name=\"task1\"></a>\n",
-        "\n",
-        "\n",
-        "### Overview\n",
-        "\n",
-        "The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver  \n",
-        "\n",
-        "Your task is to:\n",
-        "\n",
-        "* Optimize performance with `-Ofast` flag\n",
-        "* Optimize performance with profile directed feedback \n",
-        "\n",
-        "First, change the working directory to `Task1`."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 1,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task1\n"
-          ]
-        }
-      ],
-      "source": [
-        "%cd Task1"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Part A: `-Ofast` vs. `-O3`\n",
-        "\n",
-        "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. Right now, the Makefile specifies `-O3` as the optimization flag. Compile the code using `make` and run it with `make run` in the next two cells."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 2,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 3,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n",
-            "Job <5033> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "1.13user 0.00system 0:01.15elapsed 97%CPU (0avgtext+0avgdata 10944maxresident)k\n",
-            "2560inputs+0outputs (1major+264minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "You can use the GNU _perf_ tool to profile the application using the `perf` command (see below) and see the top time-consuming functions."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 4,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "[ perf record: Woken up 1 times to write data ]\n",
-            "[ perf record: Captured and wrote 0.172 MB perf.O3.data (4125 samples) ]\n",
-            "# To display the perf.data header info, please use --header/--header-only options.\n",
-            "#\n",
-            "#\n",
-            "# Total Lost Samples: 0\n",
-            "#\n",
-            "# Samples: 4K of event 'cycles:u'\n",
-            "# Event count (approx.): 3867635297\n",
-            "#\n",
-            "# Overhead  Command    Shared Object      Symbol                                  \n",
-            "# ........  .........  .................  ........................................\n",
-            "#\n",
-            "    72.02%  poisson2d  poisson2d          [.] 00000040.plt_call.fmax@@GLIBC_2.17\n",
-            "    10.16%  poisson2d  poisson2d          [.] poisson2d_reference\n",
-            "     9.99%  poisson2d  poisson2d          [.] main\n",
-            "     4.69%  poisson2d  libc-2.17.so       [.] __memcpy_power7\n",
-            "     2.23%  poisson2d  libm-2.17.so       [.] __fmaxf\n",
-            "     0.75%  poisson2d  libm-2.17.so       [.] __exp_finite\n",
-            "     0.07%  poisson2d  poisson2d          [.] 00000040.plt_call.memcpy@@GLIBC_2.17\n",
-            "     0.02%  poisson2d  poisson2d          [.] check_results\n",
-            "     0.02%  poisson2d  libm-2.17.so       [.] __GI___exp\n",
-            "     0.01%  poisson2d  ld-2.17.so         [.] _dl_relocate_object\n",
-            "     0.01%  poisson2d  [kernel.kallsyms]  [k] arch_local_irq_restore\n",
-            "     0.00%  poisson2d  ld-2.17.so         [.] _dl_new_object\n",
-            "     0.00%  poisson2d  ld-2.17.so         [.] _start\n",
-            "\n",
-            "\n",
-            "#\n",
-            "# (Tip: Show user configuration overrides: perf config --user --list)\n",
-            "#\n"
-          ]
-        }
-      ],
-      "source": [
-        "# perf record creates a perf.data file \n",
-        "!perf record -o perf.O3.data -e cycles ./poisson2d\n",
-        "# perf report opens the perf.data file \n",
-        "!perf report -i perf.O3.data | cat"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "**TASK**: Now change the optimization flag in the [Makefile](/edit/Task1/Makefile) to `-Ofast` and repeat the steps in the following cell. In case you follow along non-interactive, call `make` and `make run` in your shell. (If you are in the Jupyter Notebook, you can actually click the link of the [Makefile](/edit/Task1/Makefile). In other cases, use `vim` which is installed on Ascent.)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 5,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 6,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n",
-            "Job <5034> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "0.51user 0.00system 0:00.52elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n",
-            "256inputs+0outputs (0major+264minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 7,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "[ perf record: Woken up 1 times to write data ]\n",
-            "[ perf record: Captured and wrote 0.086 MB perf.Ofast.data (1889 samples) ]\n",
-            "# To display the perf.data header info, please use --header/--header-only options.\n",
-            "#\n",
-            "#\n",
-            "# Total Lost Samples: 0\n",
-            "#\n",
-            "# Samples: 1K of event 'cycles:u'\n",
-            "# Event count (approx.): 1765737747\n",
-            "#\n",
-            "# Overhead  Command    Shared Object  Symbol                 \n",
-            "# ........  .........  .............  .......................\n",
-            "#\n",
-            "    44.65%  poisson2d  poisson2d      [.] main\n",
-            "    43.84%  poisson2d  poisson2d      [.] poisson2d_reference\n",
-            "    10.28%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
-            "     1.12%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
-            "     0.05%  poisson2d  poisson2d      [.] check_results\n",
-            "     0.03%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
-            "     0.02%  poisson2d  libc-2.17.so   [.] __readdir64\n",
-            "     0.01%  poisson2d  ld-2.17.so     [.] _dl_new_object\n",
-            "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
-            "\n",
-            "\n",
-            "#\n",
-            "# (Tip: System-wide collection from all CPUs: perf record -a)\n",
-            "#\n"
-          ]
-        }
-      ],
-      "source": [
-        "# perf record creates a perf.data file \n",
-        "!perf record -o perf.Ofast.data -e cycles ./poisson2d\n",
-        "# perf report opens the perf.data file \n",
-        "!perf report -i perf.Ofast.data | cat"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "If `perf` is unavailable to you on other machines, you can also study the disassembly with `objdump`: `objdump -lSd ./poisson2d > poisson2d.dis` (feel free to experiment with this in the Notebook as well, just prefix the command with a `!` to execute it.)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "####  Interpretation\n",
-        "\n",
-        "Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with `-Ofast` which enables `\u2013ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Part B: Profile-directed Feedback\n",
-        "\n",
-        "For the first level of optimization we saw `Ofast` cut the execution time of the `O3` binary by almost half.\n",
-        "\n",
-        "We can optimize the performance further by using profile directed feedback optimization.\n",
-        "\n",
-        "To compile using profile directed feedback with the GCC compiler we need to do the following steps\n",
-        "\n",
-        "1. We need to first build a training binary using `-fprofile-generate`; this instructs the compiler to record hot path information \n",
-        "2. Run the training binary with a smaller input size; you should see a `.gcda` file generated which stores hot path information for further optimization by the compiler \n",
-        "3. build the final binary using `-fprofile-use` which uses the profile information in the `.gcda` file \n",
-        "4. Compare the performance of the final binary with the original `Ofast` binary \n",
-        "\n",
-        "**TASK**: First, search for `TODO1` in the [Makefile](/edit/Task1/Makefile). It defines an additional compilation flag for `gcc`. Insert `-fprofile-generate=FOLDER` there with FOLDER pointing to `$$SC18_DIR_SCRATCH`, your personal write-directory (the double dollar signs are intentional as they are used to escape in the GNU Make syntax).\n",
-        "\n",
-        "After editing, run the following two cells to train your program."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 8,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-generate=$SC18_DIR_SCRATCH\" -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-generate=$SC18_DIR_SCRATCH\" -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_train  -lm\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make poisson2d_train"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 9,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_train 200 64 64\n",
-            "Job <5035> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 200 iterations on 64 x 64 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.248743\n",
-            "  100, 0.124046\n",
-            "Calculate current execution.\n",
-            "    0, 0.248743\n",
-            "  100, 0.124046\n",
-            "0.00user 0.00system 0:00.10elapsed 5%CPU (0avgtext+0avgdata 5248maxresident)k\n",
-            "512inputs+0outputs (0major+115minor)pagefaults 0swaps\n",
-            "mv $SC18_DIR_SCRATCH/*.gcda .\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run_train"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Now, a `.gcda` file exists in the directory which can be used for an profile-accelerated subsequent run.\n",
-        "\n",
-        "**TASK**: Edit the [Makefile](/edit/Task1/Makefile) again, this time modifying `TODO2` to be equivalent to `-fprofile-use`. A directory is not needed as we copied the gcda file into the current directory.\n",
-        "\n",
-        "Run the following cells in order to build using the newly added flag and then run with the profile-accelerated version."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 10,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-use\" -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-use\" -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_profile  -lm\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make poisson2d_profile"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 11,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_profile\n",
-            "Job <5036> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "0.47user 0.00system 0:00.48elapsed 98%CPU (0avgtext+0avgdata 10816maxresident)k\n",
-            "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run_profile"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "#### References\n",
-        "\n",
-        "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
-        "2. https://perf.wiki.kernel.org/index.php/Tutorial"
-      ]
-    },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hands-On Performance Optimization\n",
+    "_Supercomputing 2019 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 18th 2019_\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.\n",
+    "\n",
+    "## Jupyter notebook execution\n",
+    "\n",
+    "When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n",
+    "\n",
+    "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
+    "\n",
+    "If you want you also can get a terminal in your browser; just open it via the \u00bbNew Launcher\u00ab button (`+`).\n",
+    "\n",
+    "## Terminal fallback\n",
+    "\n",
+    "The tasks are place in directories named `Task[1-3]`.\n",
+    "\n",
+    "Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "[Back to Top](#top)\n",
-        "\n",
-        "---"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc (GCC) 9.2.0\n",
+      "Copyright (C) 2019 Free Software Foundation, Inc.\n",
+      "This is free software; see the source for copying conditions.  There is NO\n",
+      "warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!gcc --version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tasks<a name=\"top\"></a>\n",
+    "\n",
+    "This session comes with multiple tasks, each one to be found in the respective sub-directory `Task[1-3]`. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n",
+    "\n",
+    "Please choose from the task below.\n",
+    "\n",
+    "\n",
+    "* [Task 1](#task1): __Basic compiler optimization flags and compiler annotations__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback. Learn about compiler annotations.\n",
+    "\n",
+    "* [Task 2](#task2): __Optimization via Prefetching controlled by compiler__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance. \n",
+    "* [Task 3](#task3): __Optimization via OpenMP controlled by compiler and the system__\n",
+    "\n",
+    "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance. \n",
+    "  \n",
+    "* [Suvery](#survey) Please remember to take the survey !\n",
+    "    \n",
+    "### Make Targets <a name=\"make\"></a>\n",
+    "\n",
+    "For all tasks we have defined the following make targets. \n",
+    "\n",
+    "* __poisson2d__:  \n",
+    "  build `poisson2d` binary (default)\n",
+    "* __run__:  \n",
+    "   run `poisson2d` with default parameters\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 1: Basic compiler optimization flags and compiler annotations <a name=\"task1\"></a>\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver  \n",
+    "\n",
+    "Your task is to:\n",
+    "\n",
+    "* Optimize performance with `-Ofast` flag\n",
+    "* Verify the cause for performance improvement by viewing perf profiles of O3 and Ofast binaries \n",
+    "* Optimize performance with profile directed feedback \n",
+    "* Generate compiler annotations/remarks to understand the optimizations done by the compiler with and without profile directed feedback \n",
+    "\n",
+    "\n",
+    "First, change the working directory to `Task1`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Task 2:<a name=\"task2\"></a> Software Pretechting\n",
-        "\n",
-        "\n",
-        "### Overview\n",
-        "\n",
-        "Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
-        "\n",
-        "First, change directory to that of Task 2"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task1\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd Task1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: `-Ofast` vs. `-O3`\n",
+    "\n",
+    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. As in the previous task, we use a `Makefile` for compilation. The `Makefile` targets `poisson2d_O3` and `poisson2d_Ofast` are already prepared. \n",
+    "\n",
+    "**TASK**: Add `-O3` as the optimization flag for the `poisson2d_O3` target by using the corresponding `CFLAGS` definition. There are notes relating to this Task 1 in the header of the `Makefile`. Compile the code using `make` as indicated below and run with the `Make` targets `run`, `run_perf` and `run_perf_recrep`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 12,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task2\n"
-          ]
-        }
-      ],
-      "source": [
-        "%cd ../Task2"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_O3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Part A: Running"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24897> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's have a look at the output of the `Makefile` target `run_perf`. It invokes the GNU _perf_ tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to _perf_."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Look at the [Makefile](/edit/Task2/Makefile) and work on the TODOs. Please implement compile flags as mentioned in the Makefile target name.\n",
-        "\n",
-        "Afterwards, compile each target with the following cells and submit them to the batch system. Follow along accordingly in the non-interactive version of this Notebook."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24898> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16264721613      cycles:u                                                    \n",
+      "       28463907825      instructions:u            #    1.75  insn per cycle                                            \n",
+      "\n",
+      "       4.738444892 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we run the makefile with target `run_perf_recrep` that prints the top routines of the application in terms of hotness by using a combination of `perf record ./app` and `perf report`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 13,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_pref  -lm\n",
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_pref\n",
-            "Job <5037> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "1.12user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10880maxresident)k\n",
-            "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run_o3_pref"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24899> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "[ perf record: Woken up 3 times to write data ]\n",
+      "[ perf record: Captured and wrote 0.739 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (19102 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24900> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "# To display the perf.data header info, please use --header/--header-only options.\n",
+      "#\n",
+      "#\n",
+      "# Total Lost Samples: 0\n",
+      "#\n",
+      "# Samples: 19K of event 'cycles:u'\n",
+      "# Event count (approx.): 16254596654\n",
+      "#\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
+      "#\n",
+      "    65.50%  poisson2d  poisson2d      [.] 00000038.plt_call.fmax@@GLIBC_2.17\n",
+      "    21.21%  poisson2d  poisson2d      [.] main\n",
+      "     9.18%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     3.28%  poisson2d  libm-2.17.so   [.] __fmaxf\n",
+      "     0.74%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libm-2.17.so   [.] __GI___exp\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _wordcopy_fwd_aligned\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
+      "\n",
+      "\n",
+      "#\n",
+      "# (Tip: Limit to show entries above 5% only: perf report --percent-limit 5)\n",
+      "#\n"
+     ]
+    }
+   ],
+   "source": [
+    "# run_perf_recrep displays the top hot routines \n",
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Now add the optimization flag `Ofast` to the `CFLAGS` for target `poisson2d_Ofast`. Compile the program with the target `poisson2d_Ofast` and run and analyse it as before with `run`, `run_perf` and `run_perf_recrep`.\n",
+    "\n",
+    "What difference do you see?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 14,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_pref  -lm\n",
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_pref\n",
-            "Job <5038> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "0.77user 0.00system 0:00.77elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n",
-            "256inputs+0outputs (0major+264minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run_ofast_pref"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24901> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.41user 0.00system 0:02.41elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast \n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, run a `perf`-instrumented version:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 15,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_nopref  -lm\n",
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_nopref\n",
-            "Job <5039> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "1.13user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10944maxresident)k\n",
-            "256inputs+0outputs (0major+266minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run_o3_nopref"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24902> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        8258991976      cycles:u                                                    \n",
+      "       12013091172      instructions:u            #    1.45  insn per cycle                                            \n",
+      "\n",
+      "       2.408703909 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Generate the list of top routines in terms of hotness:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 16,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_nopref  -lm\n",
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_nopref\n",
-            "Job <5040> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "0.82user 0.00system 0:00.82elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n",
-            "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run_ofast_nopref"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24903> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "[ perf record: Woken up 2 times to write data ]\n",
+      "[ perf record: Captured and wrote 0.382 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (9728 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24904> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "# To display the perf.data header info, please use --header/--header-only options.\n",
+      "#\n",
+      "#\n",
+      "# Total Lost Samples: 0\n",
+      "#\n",
+      "# Samples: 9K of event 'cycles:u'\n",
+      "# Event count (approx.): 8268811890\n",
+      "#\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
+      "#\n",
+      "    81.12%  poisson2d  poisson2d      [.] main\n",
+      "    17.97%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     0.79%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.02%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.01%  poisson2d  libc-2.17.so   [.] vfprintf@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libc-2.17.so   [.] _dl_addr\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] open_path\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] init_tls\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
+      "\n",
+      "\n",
+      "#\n",
+      "# (Tip: For tracepoint events, try: perf report -s trace_fields)\n",
+      "#\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If `perf` is unavailable to you on other machines, you can also study the disassembly with `objdump`: `objdump -lSd ./poisson2d > poisson2d.dis` (feel free to experiment with this in the Notebook as well, just prefix the command with a `!` to execute it.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "####  Interpretation\n",
+    "\n",
+    "Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with `-Ofast` which enables `\u2013ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Profile-directed Feedback\n",
+    "\n",
+    "For the first level of optimization we see that `Ofast` cut the execution time of the `O3` binary by almost half.\n",
+    "\n",
+    "We can optimize the performance further by using profile-directed feedback optimization.\n",
+    "\n",
+    "To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:\n",
+    "\n",
+    "1. Instrument binary;\n",
+    "2. Run binary with training, gather profile information;\n",
+    "3. Use profile information to generate optimized binary.\n",
+    "\n",
+    "\n",
+    "Step 1 is achieved by compiling the binary with the correct flag \u2013\u00a0`-fprofile-generate`. In our case, we need to specify an output location, which should be `$(SC19_DIR_SCRATCH)`.\n",
+    "\n",
+    "Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension `.gcda`) is written to the directory specified during compilation.\n",
+    "\n",
+    "For Step 3, the binary is once again compiled, but this time using the `gcda` profile just generated. The according flag is `-fprofile-use`, which we set to `$(SC19_DIR_SCRATCH)` as well.\n",
+    "\n",
+    "In our `Makefile` at hand, we prepared the steps already for you in the form of two targets.\n",
+    "\n",
+    "* `poisson2d_train`: Will compile the binary with profile-directed feedback\n",
+    "* `poisson2d_ref`: Will take a generated profile and compile a new, optimized binary\n",
+    "\n",
+    "By using dependencies, between these two targets a profile run is launched.\n",
+    "\n",
+    "**TASK**: Edit the [Makefile](`Makefile`) and add the `-fprofile-*` flags to the `CFLAGS` of `poisson2d_train` and\n",
+    "`poisson2d_ref` as outline in the file.\n",
+    "\n",
+    "After that, you may launch them with the following cells (`gen_profile` is a meta-target and uses `poisson2d_train` and `poisson2d_ref`). If you need to clean the generated profile, you may use `make clean_profile`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Do you notice the impact difference with optimization levels? It's always important to carefully study the interplay of flags."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24905> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_ref -lm \n",
+      "cp poisson2d_ref poisson2d\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make gen_profile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Part B: Analysis of Instructions\n",
-        "\n",
-        "Compilation with the software prefetching flag causes the compiler to generate the `__dcbt` and `__dcbtst`  instructions that prefetch memory values to L3.\n",
-        "\n",
-        "Verify it using `objdump -lSd` on each file (`poisson2d_o3_pref`, `poisson2d_ofast_pref`, `poisson2d_o3_nopref`, `poisson2d_ofast_nopref`). You might want to grep for `dcb`."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24906> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.28user 0.01system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's also measure instructions and cycles"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "exercise": "task"
-      },
-      "outputs": [],
-      "source": [
-        "#!objdump -l\u2026"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24907> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        7925983538      cycles:u                                                    \n",
+      "       12253080719      instructions:u            #    1.55  insn per cycle                                            \n",
+      "\n",
+      "       2.313471365 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Compiler annotations/Remarks\n",
+    "\n",
+    "Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done. \n",
+    "\n",
+    "To generate compiler annotations using GCC, one uses `-fopt-info-all`. If you only want to see the missed options, use the option `-fopt-info-missed` instead of `-fopt-info-all`. See also the [documentation of GCC regarding the flag](https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info).\n",
+    "\n",
+    "**TASK**: Have a looK at the `CFLAGS` of the `Makefile` target `poisson2d_Ofast_info`. Add the flag `-fopt-info-all` to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use \u2013\u00a0for example \u2013\u00a0`-fopt-info-all=(SC19_DIR_SCRATCH)/filename`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "If you feel up to the task, you can study the number of L3 cache misses using the corresponding performance counter, `PM_L3_MISS`. Either use your knowledge from Hands-On 1, or use the following call to `perf`, in which we already converted the named counter to a raw counter address."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all poisson2d.c poisson2d_reference.o -o poisson2d_Ofast_info  -lm\n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.\n",
+    "\n",
+    "**TASK**: \n",
+    "Adapt the `CFLAGS` of `poisson2d_ref_info` to include `-fopt-info-all` **and** the profile input of `-fprofile-use=\u2026` here. *(Be advised: Long output!)*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 35,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Job <5048> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "\n",
-            " Performance counter stats for './poisson2d_ofast_nopref':\n",
-            "\n",
-            "        2829292169      cycles:u                                                    \n",
-            "         136018637      r168a4:u                                                    \n",
-            "\n",
-            "       0.826136863 seconds time elapsed\n",
-            "\n",
-            "Job <5049> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "\n",
-            " Performance counter stats for './poisson2d_ofast_pref':\n",
-            "\n",
-            "        2654990243      cycles:u                                                    \n",
-            "         128824827      r168a4:u                                                    \n",
-            "\n",
-            "       0.775593651 seconds time elapsed\n",
-            "\n"
-          ]
-        }
-      ],
-      "source": [
-        "for f in [\"poisson2d_ofast_nopref\", \"poisson2d_ofast_pref\"]:\n",
-        "    !eval $$SC18_SUBMIT_CMD perf stat -e cycles,r168a4 ./$f\n"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "Increasing alignment of decl: __gcov0.main\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_D_00100_1_main/48 -> __gcov_exit/55, function body not available\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_I_00100_0_main/47 -> __gcov_init/54, function body not available\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 295->295 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:122:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:88:5: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:72:5: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_337, ny_124, nx_286);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_316, error_118);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_127);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_311);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_122);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_129 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_132 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_140 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 53\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 50\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 47\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 3 times (header execution count 9800)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 33\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 30\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 42\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 40\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 60\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 23\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 3 times (header execution count 100)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 12\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 16\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_init (&*.LPBX0);\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_exit ();\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24908> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "libgcov profiling error:/gpfs/wolf/trn003/scratch/aherten//#autofs#nccsopen-svm1_home#aherten#SC19-Tutorial#3-Optimizing_POWER#Handson#Task1#poisson2d.gcda:overwriting an existing profile data with a different timestamp\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c poisson2d_reference.o -o poisson2d_ref_info  -lm\n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 47\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 44\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 40\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 27\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 24\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 37\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 35\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 51\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 18\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 7 times (header execution count 99)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 9\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 14\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_ref_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Comparing the annotations generated of a plain `-Ofast` optimization level and the one generated at `-Ofast` and profile directed feedback, we observe that many more optimizations are possible due to profile information.\n",
+    "\n",
+    "For instance you will see annotations such as\n",
+    "```\n",
+    "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+    "```\n",
+    "\n",
+    "The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "\n",
+    "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
+    "2. https://perf.wiki.kernel.org/index.php/Tutorial"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 2:<a name=\"task2\"></a> Impact of Prefetching on Performance\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "* Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
+    "* Verify the impact by measuring cache counters with and without prefetching.\n",
+    "* Learn how to modify contents of DSCR (*Data Stream Control Register*) using IBM XL compiler and study the impact with different values to DSCR. \n",
+    "\n",
+    "But first, lets change directory to that of Task 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "#### References\n",
-        "\n",
-        "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
-        "2. https://www.gnu.org/software/gcc/projects/prefetch.html"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task2\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd ../Task2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: Software Prefetching"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Look at the Makefile and work on the TODOs. \n",
+    "\n",
+    "- First generate a `-Ofast`-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!\n",
+    "- Modify the `Makefile` to add the option for software prefetching (`-fprefetch-loop-arrays`). Compare performance of `-Ofast` with and without software prefetching"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "[Back to Top](#top)\n",
-        "\n",
-        "---"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rm -f poisson2d poisson2d*.o\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make clean"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Task 3: OpenMP\n",
-        "<a name=\"task3\"></a>\n",
-        "\n",
-        "\n",
-        "### Overview\n",
-        "\n",
-        "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores.\n",
-        "\n",
-        "First, we need to change directory to that of Task3."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "make: `poisson2d' is up to date.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24911> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.39user 0.01system 0:02.40elapsed 100%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24912> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        8271503902      cycles:u                                                    \n",
+      "         481152478      r168a4:u                                                    \n",
+      "\n",
+      "       2.412224884 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 1,
-      "metadata": {
-        "ExecuteTime": {
-          "end_time": "2018-11-07T13:47:57.724441Z",
-          "start_time": "2018-11-07T13:47:57.718745Z"
-        }
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task3\n"
-          ]
-        }
-      ],
-      "source": [
-        "%cd ../Task3"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
+      "cp poisson2d_pref poisson2d\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24919> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1.92user 0.00system 0:01.93elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24920> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        6586609284      cycles:u                                                    \n",
+      "         459879452      r168a4:u                                                    \n",
+      "\n",
+      "       1.925399505 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_pref CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Repeat the experiment with the `-O3` flag. Have a look at the `Makefile` and the outlined TODO. There's a position to easily adapt `-Ofast`\u2192`-O3`!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 100,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Part A: Implement OpenMP Pragmas; Compilation\n",
-        "\n",
-        "**Task**: Please add the correct OpenMP pragmas to the source code and compilations flags to enable OpenMP.\n",
-        "\n",
-        "* **pragmas**: Look at the TODOs in [`poisson2d.c`](/edit/Task3/poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for`\n",
-        "* **Compilation**: Please add compilation flags enabling OpenMP in GCC to the [Makefile](/edit/Task3/Makefile). The flag in question is `-fopenmp`.\n",
-        "\n",
-        "Edit the files with the links above if you are running the interactive version of the Notebook or navigate to `poisson2d.c` and `Makefile` yourself in case you run the non-interactive version.\n",
-        "\n",
-        "Afterwards, compile and run the application with the following cells. Non-interactive: Follow along accordingly in the shell."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24923> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24924> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16445764669      cycles:u                                                    \n",
+      "         645094089      r168a4:u                                                    \n",
+      "\n",
+      "       4.792567763 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 37,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
-            "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make poisson2d"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24925> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.74user 0.00system 0:04.74elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24926> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16239159454      cycles:u                                                    \n",
+      "         631061431      r168a4:u                                                    \n",
+      "\n",
+      "       4.730144897 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_pref CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Analysis of Instructions\n",
+    "\n",
+    "Compilation of the `-Ofast` binary with the software prefetching flag causes the compiler to generate the `dcb*`  instructions that prefetch memory values to L3."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: \n",
+    "Run `$(SC19_SUBMIT_CMD) objdump -lSd` on each binary file (`-O3`, `-Ofast` with prefetch/no prefetch).\n",
+    "Look for instructions beginning with `dcb`\n",
+    "At what optimization levels does the compiler generate software prefetching instructions?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 114,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 40,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n",
-            "Job <5052> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "Calculate current execution.\n",
-            "    0, 0.249980\n",
-            "  100, 0.246028\n",
-            "  200, 0.242198\n",
-            "  300, 0.238487\n",
-            "  400, 0.234887\n",
-            "500x500: Ref:   0.2571 s, This:   0.2946 s, speedup:     0.87\n",
-            "1.48user 0.00system 0:00.56elapsed 263%CPU (0avgtext+0avgdata 9664maxresident)k\n",
-            "0inputs+0outputs (0major+273minor)pagefaults 0swaps\n"
-          ]
-        }
-      ],
-      "source": [
-        "!make run"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=gcc -B poisson2d_pref\n",
+    "!objdump -lSd ./poisson2d_pref > poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 116,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "The command to submit a job to the batch system is prepared in an environment variable `$SC18_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to increase the work of the application."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    10000b28:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000b30:\t2c ba 00 7c \tdcbt    0,r23\n",
+      "    10000b38:\t2c b2 00 7c \tdcbt    0,r22\n",
+      "    10000b50:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000b58:\tec b9 00 7c \tdcbtst  0,r23\n",
+      "    10000b80:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000e64:\t2c 92 00 7c \tdcbt    0,r18\n",
+      "    10000e68:\t2c 9a 00 7c \tdcbt    0,r19\n",
+      "    10000e6c:\t2c a2 00 7c \tdcbt    0,r20\n",
+      "    10000e70:\t2c aa 00 7c \tdcbt    0,r21\n",
+      "    10000e7c:\t2c b2 00 7c \tdcbt    0,r22\n",
+      "    10000e80:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000e94:\tec b9 00 7c \tdcbtst  0,r23\n"
+     ]
+    }
+   ],
+   "source": [
+    "!grep dcb poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Changing Values of DSCR via compiler flags\n",
+    "\n",
+    "This task requires using the IBM XL compiler. It should be already in your environment.\n",
+    "\n",
+    "\n",
+    "We saw the impact of software prefetching in the previous subsection. \n",
+    "In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. \n",
+    "In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching. \n",
+    "\n",
+    "IBM XL compiler has an option `-qprefetch=dscr=<val>` that can be used for this purpose.\n",
+    "Compiling with `-qprefetch=dscr=1` turns off the prefetcher. One can give various values such as `-qprefetch=dscr=4`, `-qprefetch=dscr=7` etc. to control aggressiveness of prefetching.\n",
+    "\n",
+    "For this exercise we use `make CC=xlc_r` to illustrate the performance impact.\n",
+    "    \n",
+    "\n",
+    "**Task** Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the `Makefile`: Add `qprefetch=dscr=1` to the `CFLAGS` and rebuild the application and note the performance. Which one is faster? \n",
+    "\n",
+    "In general, applications benefit with the default settings of hardware DSCR register (`-qprefetch=dscr=0`). However, certain applications also benefit with prefetching turned off. \n",
+    "\n",
+    "It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL at default DSCR value"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 117,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 3,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Job <5344> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 1000 iterations on 1000 x 100 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249743\n",
-            "  100, 0.210080\n",
-            "  200, 0.184635\n",
-            "  300, 0.166526\n",
-            "  400, 0.152783\n",
-            "  500, 0.141890\n",
-            "  600, 0.132978\n",
-            "  700, 0.125511\n",
-            "  800, 0.119142\n",
-            "  900, 0.113632\n",
-            "Calculate current execution.\n",
-            "    0, 0.249743\n",
-            "  100, 0.210080\n",
-            "  200, 0.184635\n",
-            "  300, 0.166526\n",
-            "  400, 0.152783\n",
-            "  500, 0.141890\n",
-            "  600, 0.132978\n",
-            "  700, 0.125511\n",
-            "  800, 0.119142\n",
-            "  900, 0.113632\n",
-            "1000x100: Ref:   1.9872 s, This:   0.2385 s, speedup:     8.33\n"
-          ]
-        }
-      ],
-      "source": [
-        "!eval $SC18_SUBMIT_CMD ./poisson2d 1000 1000"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  poisson2d.c -o poisson2d  -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24927> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "2.26user 0.00system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+477minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=xlc_r -B poisson2d\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL with DSCR value turned off"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n",
-        "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  -qprefetch=dscr=1 poisson2d.c -o poisson2d_dscr  -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24929> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.58user 0.00system 0:04.59elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "0inputs+0outputs (0major+476minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_dscr CC=xlc_r -B\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher? "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "\n",
+    "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
+    "2. https://www.gnu.org/software/gcc/projects/prefetch.html\n",
+    "3. https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 3: OpenMP\n",
+    "<a name=\"task3\"></a>\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.\n",
+    "First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is `poisson2d_reference`. We parallelize only the main loop but not `poisson2d_reference`. The speedup is the performance gain seen in the main loop as compared to the reference loop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 22,
-      "metadata": {
-        "exercise": "task"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Job <5379> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
-            "Calculate reference solution and time with serial CPU execution.\n",
-            "    0, 0.249995\n",
-            "  100, 0.248997\n",
-            "  200, 0.248007\n",
-            "  300, 0.247025\n",
-            "  400, 0.246050\n",
-            "  500, 0.245084\n",
-            "  600, 0.244124\n",
-            "  700, 0.243173\n",
-            "  800, 0.242228\n",
-            "  900, 0.241291\n",
-            "Calculate current execution.\n",
-            "    0, 0.249995\n",
-            "  100, 0.248997\n",
-            "  200, 0.248007\n",
-            "  300, 0.247025\n",
-            "  400, 0.246050\n",
-            "  500, 0.245084\n",
-            "  600, 0.244124\n",
-            "  700, 0.243173\n",
-            "  800, 0.242228\n",
-            "  900, 0.241291\n",
-            "1000x1000: Ref:   2.3303 s, This:   2.8446 s, speedup:     0.82\n"
-          ]
-        }
-      ],
-      "source": [
-        "!eval OMP_NUM_THREADS=1 $SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task3\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd ../Task3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: Implement OpenMP Pragmas; Compilation\n",
+    "\n",
+    "**Task**: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.\n",
+    "\n",
+    "* **Directives**: Look at the TODOs in [`poisson2d.c`](poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for` (and once it's `#pragma omp parallel for reduction(max:error)` \u2013\u00a0can you guess where?)\n",
+    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC and XL to the `Makefile`. For GCC, we need to add `-fopenmp` and the application needs to be linked with `-lgomp`. For XL, we need to add `-qsmp=omp` to the list of compilation flags. \n",
+    "\n",
+    "Afterwards, compile and run the application with the following commands."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Part B: Bindings\n",
-        "\n",
-        "Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!\n",
-        "\n",
-        "There are applications which can be used to determine the configuration of the processor. Among those are:\n",
-        "\n",
-        "* `lscpu`: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.\n",
-        "* `ppc64_cpu --smt`: Specifically for POWER, this tool can give information about the number of simulations threads running per core (*SMT*, Simulataion Multi-Threading).\n",
-        "\n",
-        "Run `ppc64_cpu --smt` to find out about the threading configuration of Ascent!"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -c -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp   poisson2d_reference.c -o poisson2d_reference.o -lm\n",
+      "gcc -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp  poisson2d.c poisson2d_reference.o -o poisson2d  -lm \n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The command to submit a job to the batch system is prepared in an environment variable `$SC19_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to invoke the application using the batch system. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 48,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Job <5076> is submitted to default queue <batch>.\n",
-            "<<Waiting for dispatch ...>>\n",
-            "<<Starting on login1>>\n",
-            "SMT=4\n"
-          ]
-        }
-      ],
-      "source": [
-        "!eval $SC18_SUBMIT_CMD ppc64_cpu --smt"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24951> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1000x1000: Ref:   4.7430 s, This:   3.9363 s, speedup:     1.20\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval $SC19_SUBMIT_CMD ./poisson2d 1000 1000 1000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Inorder to run the parallel application, we need to set the number of threads using `OMP_NUM_THREADS`\n",
+    "What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n",
+    "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "There are more sources information available\n",
-        "\n",
-        "* `/proc/cpuinfo`: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with `cat`\n",
-        "* `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list`: Holds information about thread siblings for given CPU core (`cpu0` in this case). Use it to find out which thread is mapped to which core."
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24945> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "\n",
+      "libgomp: Invalid value for environment variable OMP_NUM_THREADS\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1000x1000: Ref:   2.1046 s, This:   2.4171 s, speedup:     0.87\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval OMP_NUM_THREADS=N $SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Bindings\n",
+    "\n",
+    "Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!\n",
+    "\n",
+    "There are applications which can be used to determine the configuration of the processor. Among those are:\n",
+    "\n",
+    "* `lscpu`: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.\n",
+    "* `ppc64_cpu --smt`: Specifically for POWER, this tool can give information about the number of simulations threads running per core (*SMT*, Simulataion Multi-Threading).\n",
+    "\n",
+    "Run `ppc64_cpu --smt` to find out about the threading configuration of Ascent!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": 49,
-      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "0-3\n",
-            "4-7\n"
-          ]
-        }
-      ],
-      "source": [
-        "!cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n",
-        "!cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24465> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "SMT=4\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval $SC19_SUBMIT_CMD ppc64_cpu --smt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are more sources information available\n",
+    "\n",
+    "* `/proc/cpuinfo`: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with `cat`\n",
+    "* `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list`: Holds information about thread siblings for given CPU core (`cpu0` in this case). Use it to find out which thread is mapped to which core."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "There are various environment variables available within OpenMP (and GCC) to specify binding of threads to cores. See, for instance, the [online documentation of GCC libgomp](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html). Examples are `OMP_PLACES` or `GOMP_CPU_AFFINITY`.\n",
-        "\n",
-        "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
-        "\n",
-        "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
-        "\n",
-        "What's your maximum speedup?"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24949> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "0-3\n",
+      "Job <24950> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "4-7\n"
+     ]
+    }
+   ],
+   "source": [
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n",
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the [OMP_PLACES environment Variable](https://www.openmp.org/spec-html/5.0/openmpse53.html). We also have a GNU specific variable which can also be used to control affinity - `GOMP_CPU_AFFINITY`. Setting `GOMP_CPU_AFFINITY` is specific to GCC binaries but it internally serves the same function as setting `OMP_PLACES`. \n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [
     {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "exercise": "task"
-      },
-      "outputs": [],
-      "source": [
-        "!eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=\"X,Y,Z,A\" OMP_NUM_THREADS=4 $$SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/usr/bin/sh: OMP_PLACES}={X},{Y},{Z},{A}: command not found\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval OMP_DISPLAY_ENV=true OMP_PLACES=\"{X},{Y},{Z},{A}\" OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "#### References\n",
-        "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=\"X,Y,Z,A\" OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Great!\n",
+    "\n",
+    "If you still have time: The same experiments can be repeated with the IBM XL compiler. \n",
+    "The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is `-qsmp=omp`\n",
+    "\n",
+    "**Task**: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "[Back to Top](#top)\n",
-        "\n",
-        "---"
-      ]
-    },
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r -c -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp    poisson2d_reference.c -o poisson2d_reference.o -lm \n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "xlc_r -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS   time ./poisson2d\n",
+      "Job <24956> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "1000x1000: Ref:   5.6783 s, This:   2.6528 s, speedup:     2.14\n",
+      "21.56user 6.18system 0:08.37elapsed 331%CPU (0avgtext+0avgdata 23040maxresident)k\n",
+      "3200inputs+0outputs (2major+1098minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=xlc_r -B run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the parallel application with varying numbre of threads (`OMP_NUM_THREADS`) and note the performance improvement. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [
     {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "# Survey<a name=\"survey\"></a>\n",
-        "\n",
-        "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc18-eval)."
-      ]
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <23926> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1000x1000: Ref:   2.3932 s, This:   2.7175 s, speedup:     0.88\n"
+     ]
     }
-  ],
-  "metadata": {
-    "kernelspec": {
-      "display_name": "Python 3",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.6.7"
+   ],
+   "source": [
+    "!eval OMP_NUM_THREADS=N $SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we repeat the exercise of using the right binding of threads for the XL binary. `OMP_PLACES` pertains to the XL binary as well as it is an OpenMP variable.  `GOMP_CPU_AFFINITY` is specific to GCC binary so that cannot be used to set the binding.\n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "We are mixing Python with Bash (`!`) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two `$$`)\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: {X},{Y},{Z},{A}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1587-117 The string for the OpenMP environment variable 'OMP_PLACES' contains unexpected or invalid text.  OpenMP environment variable ignored. \n",
+      "  OMP_PLACES='cores(44)'\n",
+      "1000x1000: Ref:   2.0988 s, This:   0.6556 s, speedup:     3.20\n",
+      "Affinity: {P},{Q},{R},{S}\n",
+      "<<Waiting for dispatch ...>>\n"
+     ]
     }
+   ],
+   "source": [
+    "for affinity in [\"{X},{Y},{Z},{A}\", \"{P},{Q},{R},{S}\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000  | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html\n",
+    "2. https://www.openmp.org/spec-html/5.0/openmpse53.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Survey<a name=\"survey\"></a>\n",
+    "\n",
+    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc19-eval)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
   },
-  "nbformat": 4,
-  "nbformat_minor": 2
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
 }
\ No newline at end of file
diff --git a/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization.ipynb b/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization.ipynb
index 673f2cf2e35b60f1bb4eb271146b5e98c4e752ba..8453a7bb17b245549ca29fae91a4ff67f5f7b275 100644
--- a/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization.ipynb
+++ b/3-Optimizing_POWER/Handson/.master/HandsOnPerformanceOptimization.ipynb
@@ -22,7 +22,7 @@
     "\n",
     "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
     "\n",
-    "If you want you also can get a [terminal](/terminals/1) in your browser.\n",
+    "If you want you also can get a terminal in your browser; just open it via the »New Launcher« button (`+`).\n",
     "\n",
     "## Terminal fallback\n",
     "\n",
@@ -37,7 +37,28 @@
    "source": [
     "## Setup\n",
     "\n",
-    "This hands-on session requires use of GCC 9.2.0. By loading the `sc19/handson2` module before invoking this Notebook, we took care of also loading GCC 9.2.0."
+    "We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc (GCC) 9.2.0\n",
+      "Copyright (C) 2019 Free Software Foundation, Inc.\n",
+      "This is free software; see the source for copying conditions.  There is NO\n",
+      "warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!gcc --version"
    ]
   },
   {
@@ -107,14 +128,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 70,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "/autofs/nccsopen-svm1_home/archanaravindar/SC19-Tutorial/Task1/Solutions\n"
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task1\n"
      ]
     }
    ],
@@ -128,12 +149,14 @@
    "source": [
     "### Part A: `-Ofast` vs. `-O3`\n",
     "\n",
-    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. Use `Makefile` for this task. At present, Makefile specifies targets poisson2d_O3 and poisson2d_Ofast. Add `-O3` as the optimization flag for poisson2d_O3 target. Carry out the steps as outlined in `Task1`. Compile the code using `make -f Makefile.gcc` and run with targets 'run', 'runstats' and 'perfgenerate'. "
+    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. As in the previous task, we use a `Makefile` for compilation. The `Makefile` targets `poisson2d_O3` and `poisson2d_Ofast` are already prepared. \n",
+    "\n",
+    "**TASK**: Add `-O3` as the optimization flag for the `poisson2d_O3` target by using the corresponding `CFLAGS` definition. There are notes relating to this Task 1 in the header of the `Makefile`. Compile the code using `make` as indicated below and run with the `Make` targets `run`, `run_perf` and `run_perf_recrep`. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 52,
+   "execution_count": 84,
    "metadata": {
     "collapsed": true
    },
@@ -142,8 +165,8 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "gcc -std=c99 -mcpu=power9  -O3   -DUSE_DOUBLE  -mvsx -maltivec   poisson2d.c poisson2d_reference.o -o poisson2d_O3  -lm\n",
-      "cp poisson2d_O3 poisson2d\n"
+      "gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n"
      ]
     }
    ],
@@ -153,7 +176,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 53,
+   "execution_count": 73,
    "metadata": {},
    "outputs": [
     {
@@ -161,7 +184,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24738> is submitted to default queue <batch>.\n",
+      "Job <24897> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -176,8 +199,8 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "4.70user 0.00system 0:04.71elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
-      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n"
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
      ]
     }
    ],
@@ -189,12 +212,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The makefile has been modified to include options 'runstats' that invokes the GNU _perf_ tool to print out details of the number of PPC instructions executed and the number of cycles taken by POWER9 to execute the program. You may wish to add a similar clause to measure other raw events using _perf_."
+    "Let's have a look at the output of the `Makefile` target `run_perf`. It invokes the GNU _perf_ tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to _perf_."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 55,
+   "execution_count": 74,
    "metadata": {},
    "outputs": [
     {
@@ -202,7 +225,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
-      "Job <24739> is submitted to default queue <batch>.\n",
+      "Job <24898> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -220,37 +243,36 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "       16199088860      cycles:u                                                    \n",
-      "       28463939531      instructions:u            #    1.76  insn per cycle                                            \n",
+      "       16264721613      cycles:u                                                    \n",
+      "       28463907825      instructions:u            #    1.75  insn per cycle                                            \n",
       "\n",
-      "       4.717812694 seconds time elapsed\n",
+      "       4.738444892 seconds time elapsed\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "!make runstats"
+    "!make run_perf"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The makefile has been modified to include options 'runstats' that invokes the GNU _perf_ tool to print out details of the number of PPC instructions executed and the number of cycles taken by POWER9 to execute the program. You may wish to add a similar clause to measure other raw events using _perf_.\n",
-    "Next we run the makefile with target `perfgenerate` that prints the top routines of the application in terms of hotness. "
+    "Next we run the makefile with target `run_perf_recrep` that prints the top routines of the application in terms of hotness by using a combination of `perf record ./app` and `perf report`. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 56,
+   "execution_count": 75,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/archanaravindar//cycles.data ./poisson2d\n",
-      "Job <24740> is submitted to default queue <batch>.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24899> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -266,9 +288,9 @@
       "  800, 0.242228\n",
       "  900, 0.241291\n",
       "[ perf record: Woken up 3 times to write data ]\n",
-      "[ perf record: Captured and wrote 0.735 MB /gpfs/wolf/trn003/scratch/archanaravindar//cycles.data (18993 samples) ]\n",
-      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/archanaravindar//cycles.data  --stdio\n",
-      "Job <24741> is submitted to default queue <batch>.\n",
+      "[ perf record: Captured and wrote 0.739 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (19102 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24900> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "# To display the perf.data header info, please use --header/--header-only options.\n",
@@ -276,66 +298,61 @@
       "#\n",
       "# Total Lost Samples: 0\n",
       "#\n",
-      "# Samples: 18K of event 'cycles:u'\n",
-      "# Event count (approx.): 16162306943\n",
+      "# Samples: 19K of event 'cycles:u'\n",
+      "# Event count (approx.): 16254596654\n",
       "#\n",
-      "# Overhead  Command    Shared Object     Symbol                                  \n",
-      "# ........  .........  ................  ........................................\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
       "#\n",
-      "    48.17%  poisson2d  poisson2d         [.] main\n",
-      "    26.08%  poisson2d  poisson2d         [.] 00000038.plt_call.fmax@@GLIBC_2.17\n",
-      "    15.84%  poisson2d  libm-2.17.so      [.] __fmaxf\n",
-      "     9.13%  poisson2d  libc-2.17.so      [.] __memcpy_power7\n",
-      "     0.72%  poisson2d  libm-2.17.so      [.] __exp_finite\n",
-      "     0.01%  poisson2d  poisson2d         [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
-      "     0.01%  poisson2d  libm-2.17.so      [.] __GI___exp\n",
-      "     0.01%  poisson2d  ld-2.17.so        [.] do_lookup_x\n",
-      "     0.01%  poisson2d  ld-2.17.so        [.] check_match.10253\n",
-      "     0.01%  poisson2d  ld-2.17.so        [.] _dl_lookup_symbol_x\n",
-      "     0.01%  poisson2d  ld-2.17.so        [.] strcmp\n",
-      "     0.00%  poisson2d  [unknown]         [k] 0x000020000002415c\n",
-      "     0.00%  poisson2d  ld-2.17.so        [.] dl_main\n",
-      "     0.00%  poisson2d  ld-2.17.so        [.] _start\n",
+      "    65.50%  poisson2d  poisson2d      [.] 00000038.plt_call.fmax@@GLIBC_2.17\n",
+      "    21.21%  poisson2d  poisson2d      [.] main\n",
+      "     9.18%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     3.28%  poisson2d  libm-2.17.so   [.] __fmaxf\n",
+      "     0.74%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libm-2.17.so   [.] __GI___exp\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _wordcopy_fwd_aligned\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
       "\n",
       "\n",
       "#\n",
-      "# (Tip: Show individual samples with: perf script)\n",
+      "# (Tip: Limit to show entries above 5% only: perf report --percent-limit 5)\n",
       "#\n"
      ]
     }
    ],
    "source": [
-    "# perfgenerate displays the top hot routines \n",
-    "!make perfgenerate"
+    "# run_perf_recrep displays the top hot routines \n",
+    "!make run_perf_recrep"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**TASK**: Now Add the optimization flag `Ofast` in Makefile for target poisson2d_Ofast. Run the makefile with the targets clean, run, runstats and perfgenerate.\n",
-    "\n",
-    "Compare performance of O3 and Ofast binaries using the following makefile options- run, runstats. \n",
-    "\n",
-    "Verify by generating the profiles of both O3 and Ofast binaries to understand the cause for performance improvement. \n",
+    "**TASK**: Now add the optimization flag `Ofast` to the `CFLAGS` for target `poisson2d_Ofast`. Compile the program with the target `poisson2d_Ofast` and run and analyse it as before with `run`, `run_perf` and `run_perf_recrep`.\n",
     "\n",
-    "What difference do you see? \n",
-    "\n"
+    "What difference do you see?"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 58,
+   "execution_count": 76,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec  poisson2d.c poisson2d_reference.o -o poisson2d_Ofast  -lm\n",
-      "cp poisson2d_Ofast poisson2d\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24743> is submitted to default queue <batch>.\n",
+      "Job <24901> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -350,7 +367,7 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "2.40user 0.00system 0:02.41elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "2.41user 0.00system 0:02.41elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
       "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
      ]
     }
@@ -362,16 +379,14 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "exercise": "Solutions"
-   },
+   "metadata": {},
    "source": [
-    "Measure cycles, instructions."
+    "Again, run a `perf`-instrumented version:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 59,
+   "execution_count": 77,
    "metadata": {},
    "outputs": [
     {
@@ -379,7 +394,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
-      "Job <24744> is submitted to default queue <batch>.\n",
+      "Job <24902> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -397,38 +412,36 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "        8261506450      cycles:u                                                    \n",
-      "       12013095395      instructions:u            #    1.45  insn per cycle                                            \n",
+      "        8258991976      cycles:u                                                    \n",
+      "       12013091172      instructions:u            #    1.45  insn per cycle                                            \n",
       "\n",
-      "       2.413121525 seconds time elapsed\n",
+      "       2.408703909 seconds time elapsed\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "!make runstats"
+    "!make run_perf"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "exercise": "Solutions"
-   },
+   "metadata": {},
    "source": [
-    "Generate the list of Top routines in terms of hotness."
+    "Generate the list of top routines in terms of hotness:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 60,
+   "execution_count": 78,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/archanaravindar//cycles.data ./poisson2d\n",
-      "Job <24745> is submitted to default queue <batch>.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24903> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -444,9 +457,9 @@
       "  800, 0.242228\n",
       "  900, 0.241291\n",
       "[ perf record: Woken up 2 times to write data ]\n",
-      "[ perf record: Captured and wrote 0.382 MB /gpfs/wolf/trn003/scratch/archanaravindar//cycles.data (9722 samples) ]\n",
-      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/archanaravindar//cycles.data  --stdio\n",
-      "Job <24746> is submitted to default queue <batch>.\n",
+      "[ perf record: Captured and wrote 0.382 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (9728 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24904> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "# To display the perf.data header info, please use --header/--header-only options.\n",
@@ -455,36 +468,36 @@
       "# Total Lost Samples: 0\n",
       "#\n",
       "# Samples: 9K of event 'cycles:u'\n",
-      "# Event count (approx.): 8264083365\n",
+      "# Event count (approx.): 8268811890\n",
       "#\n",
       "# Overhead  Command    Shared Object  Symbol                                  \n",
       "# ........  .........  .............  ........................................\n",
       "#\n",
-      "    81.24%  poisson2d  poisson2d      [.] main\n",
-      "    17.83%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "    81.12%  poisson2d  poisson2d      [.] main\n",
+      "    17.97%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
       "     0.79%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
-      "     0.06%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
-      "     0.02%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
-      "     0.02%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
-      "     0.01%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.02%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
       "     0.01%  poisson2d  libc-2.17.so   [.] vfprintf@@GLIBC_2.17\n",
-      "     0.01%  poisson2d  libc-2.17.so   [.] __memset_power8\n",
       "     0.01%  poisson2d  libc-2.17.so   [.] _dl_addr\n",
-      "     0.01%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
-      "     0.00%  poisson2d  ld-2.17.so     [.] open_verify\n",
-      "     0.00%  poisson2d  ld-2.17.so     [.] strlen\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] open_path\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] init_tls\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
       "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
       "\n",
       "\n",
       "#\n",
-      "# (Tip: Profiling branch (mis)predictions with: perf record -b / perf report)\n",
+      "# (Tip: For tracepoint events, try: perf report -s trace_fields)\n",
       "#\n"
      ]
     }
    ],
    "source": [
-    "# perfgenerate creates a perf.data file \n",
-    "!make perfgenerate"
+    "!make run_perf_recrep"
    ]
   },
   {
@@ -500,7 +513,7 @@
    "source": [
     "####  Interpretation\n",
     "\n",
-    "Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with `-Ofast` which enables `–ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
+    "Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with `-Ofast` which enables `–ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
    ]
   },
   {
@@ -511,66 +524,71 @@
     "\n",
     "For the first level of optimization we see that `Ofast` cut the execution time of the `O3` binary by almost half.\n",
     "\n",
-    "We can optimize the performance further by using profile directed feedback optimization.\n",
+    "We can optimize the performance further by using profile-directed feedback optimization.\n",
+    "\n",
+    "To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:\n",
+    "\n",
+    "1. Instrument binary;\n",
+    "2. Run binary with training, gather profile information;\n",
+    "3. Use profile information to generate optimized binary.\n",
     "\n",
-    "To compile using profile directed feedback with the GCC compiler we need to build the appplication in three stages- Instrument binary; Run binary with training (smaller input) and gather profile information; Use profile information to generate optimized binary. For this purpose we have defined makefile target runpdf that depends on two binaries- poisson2d_train and poisson2d_ref.\n",
     "\n",
-    "make runpdf will inturn make poisson2d_train and poisson2d_ref\n",
+    "Step 1 is achieved by compiling the binary with the correct flag – `-fprofile-generate`. In our case, we need to specify an output location, which should be `$(SC19_DIR_SCRATCH)`.\n",
     "\n",
-    "poisson2d_train: will inturn be built by \n",
-    " - cleans up old profile data if any (rm \\$(SC19\\_DIR\\_SCRATCH)/*gcda)\n",
-    " - builds a training binary using -fprofile-generate=$(SC19_DIR_SCRATCH) along with the usual optimization flags \n",
-    " - This instructs the compiler to record hot path information.\n",
-    " - Run the training binary with a smaller input size; \n",
-    " - you should see a `.gcda` file generated which stores hot path information for further optimization by the compiler in the path specified in profile-generate  \n",
-    "   option\n",
-    " \n",
-    " - poisson2d_ref: will be built using the profile collected in \\$(SC19\\_DIR\\_SCRATCH)/*gcda files \n",
-    " - This is facilitated by the option -fprofile-use=\\$(SC19\\_DIR\\_SCRATCH) option to be added along with optimization flags\n",
-    " - Rebuilding the application with this flag builds the final binary that can be run\n",
+    "Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension `.gcda`) is written to the directory specified during compilation.\n",
     "\n",
-    "**TASK**: Run the following steps to generate the final optimized binary using the following steps. Compare the performance of the Ofast binary with and without profile directed feedback. Run the program poisson2d with increasing values of xiter, yiter, ziter and compare the performance. "
+    "For Step 3, the binary is once again compiled, but this time using the `gcda` profile just generated. The according flag is `-fprofile-use`, which we set to `$(SC19_DIR_SCRATCH)` as well.\n",
+    "\n",
+    "In our `Makefile` at hand, we prepared the steps already for you in the form of two targets.\n",
+    "\n",
+    "* `poisson2d_train`: Will compile the binary with profile-directed feedback\n",
+    "* `poisson2d_ref`: Will take a generated profile and compile a new, optimized binary\n",
+    "\n",
+    "By using dependencies, between these two targets a profile run is launched.\n",
+    "\n",
+    "**TASK**: Edit the [Makefile](`Makefile`) and add the `-fprofile-*` flags to the `CFLAGS` of `poisson2d_train` and\n",
+    "`poisson2d_ref` as outline in the file.\n",
+    "\n",
+    "After that, you may launch them with the following cells (`gen_profile` is a meta-target and uses `poisson2d_train` and `poisson2d_ref`). If you need to clean the generated profile, you may use `make clean_profile`."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 61,
+   "execution_count": 79,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "rm -rf /gpfs/wolf/trn003/scratch/archanaravindar//*gcda\n",
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec  -fprofile-generate=/gpfs/wolf/trn003/scratch/archanaravindar/ poisson2d.c  -o poisson2d_train  -lm \n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
-      "Job <24747> is submitted to default queue <batch>.\n",
+      "Job <24905> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
       "Calculate current execution.\n",
       "    0, 0.249490\n",
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec  -fprofile-use=/gpfs/wolf/trn003/scratch/archanaravindar/  poisson2d.c  -o poisson2d_ref  -lm  \n",
-      "cp poisson2d_ref poisson2d\t\n"
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_ref -lm \n",
+      "cp poisson2d_ref poisson2d\n"
      ]
     }
    ],
    "source": [
-    "!make -B runpdf"
+    "!make gen_profile"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "exercise": "Solutions"
-   },
+   "metadata": {},
    "source": [
-    "Measure Execution time."
+    "If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 62,
+   "execution_count": 80,
    "metadata": {},
    "outputs": [
     {
@@ -578,7 +596,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24748> is submitted to default queue <batch>.\n",
+      "Job <24906> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -593,27 +611,34 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "2.30user 0.00system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
-      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+      "2.28user 0.01system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n"
      ]
     }
    ],
    "source": [
-    "!make run "
+    "!make run"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "exercise": "Solutions"
+    "exercise": "solution"
    },
    "source": [
-    "Measure cycles, instructions."
+    "Great! It is! In our tests, this shaved off another 5%."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's also measure instructions and cycles"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 64,
+   "execution_count": 81,
    "metadata": {},
    "outputs": [
     {
@@ -621,7 +646,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
-      "Job <24749> is submitted to default queue <batch>.\n",
+      "Job <24907> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -639,16 +664,16 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "        7907873140      cycles:u                                                    \n",
-      "       12253084984      instructions:u            #    1.55  insn per cycle                                            \n",
+      "        7925983538      cycles:u                                                    \n",
+      "       12253080719      instructions:u            #    1.55  insn per cycle                                            \n",
       "\n",
-      "       2.308084004 seconds time elapsed\n",
+      "       2.313471365 seconds time elapsed\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "!make runstats"
+    "!make run_perf"
    ]
   },
   {
@@ -658,168 +683,29 @@
     "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "exercise": "Solutions"
-   },
-   "source": [
-    "For the problem size `NX=NY=NITER=1000` you will see that profile directed feedback improves the performance further by 4-5%. "
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Part C: Compiler annotations/Remarks\n",
     "\n",
-    "Usually all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done. \n",
-    "\n",
-    "To generate compiler annotations using GCC, start with Makefile.gcc. Add the flag -fopt-info-all=`$(SC19_DIR_SCRATCH)/filename` to CFLAGS. All annotations are stored in opt-record in the $(SC19_DIR_SCRATCH) which can be viewed as a text file. Make target view displays the contents of the file on the output screen.\n",
+    "Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done. \n",
     "\n",
-    "Specifically, if you want to view only the missed options, use the option -fopt-info-missed instead of -fopt-info-all.\n",
+    "To generate compiler annotations using GCC, one uses `-fopt-info-all`. If you only want to see the missed options, use the option `-fopt-info-missed` instead of `-fopt-info-all`. See also the [documentation of GCC regarding the flag](https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info).\n",
     "\n",
-    "**TASK**: Build the application using Makefile.gcc.record. This makefile generates the compiler annotations in $(SC19_DIR_SCRATCH)/opt-record. Using make file target view or vi, read the compiler annotations file to get an idea of the optimizations done.  "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 76,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec  -fopt-info-all=/gpfs/wolf/trn003/scratch/archanaravindar//opt-record poisson2d.c poisson2d_reference.o -o poisson2d_Ofast_record  -lm \n",
-      "cp poisson2d_Ofast_record poisson2d\n"
-     ]
-    }
-   ],
-   "source": [
-    "!make -B poisson2d_Ofast_record"
+    "**TASK**: Have a looK at the `CFLAGS` of the `Makefile` target `poisson2d_Ofast_info`. Add the flag `-fopt-info-all` to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use – for example – `-fopt-info-all=(SC19_DIR_SCRATCH)/filename`."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 73,
+   "execution_count": 82,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "poisson2d_reference.c:85:17: missed:   not inlinable: check_results/17 -> fprintf/20, function body not available\n",
-      "poisson2d_reference.c:70:31: missed:   not inlinable: poisson2d_reference/16 -> printf/18, function body not available\n",
-      "Unit growth for small function inlining: 145->145 (0%)\n",
-      "\n",
-      "Inlined 0 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_54 and *_57\n",
-      "consider run-time aliasing test between *_62 and *_67\n",
-      "consider run-time aliasing test between *_75 and *_78\n",
-      "consider run-time aliasing test between *_82 and *_87\n",
-      "poisson2d_reference.c:52:13: optimized: Loop 6 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:41:9: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d_reference.c:64:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:64:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:59:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:59:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:50:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_386, _378, _390);\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:45:90: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d_reference.c:43:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d_reference.c:33:6: note: vectorized 1 loops in function.\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_386, _378, _390);\n",
-      "poisson2d_reference.c:43:13: optimized: loop turned into non-loop; it never loops\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_386, _378, _390);\n",
-      "poisson2d_reference.c:70:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_134, error_144);\n",
-      "poisson2d_reference.c:81:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:81:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:83:35: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:83:35: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:76:5: note: vectorized 0 loops in function.\n",
-      "poisson2d_reference.c:85:17: missed: statement clobbers memory: fprintf (stderr.0_13, \"ERROR: A[%d][%d] = %f does not match %f (reference)\\n\", iy_62, ix_61, _63, _64);\n",
-      "poisson2d.c:57:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:51:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:47:20: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:157:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:156:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:155:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:154:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:145:9: missed:   not inlinable: main/32 -> check_results/38, function body not available\n",
-      "poisson2d.c:138:31: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:98:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:95:5: missed:   not inlinable: main/32 -> poisson2d_reference/37, function body not available\n",
-      "poisson2d.c:93:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:91:5: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:73:29: missed:   not inlinable: main/32 -> exp/34, function body not available\n",
-      "poisson2d.c:63:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:62:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:61:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:60:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "Unit growth for small function inlining: 234->234 (0%)\n",
-      "\n",
-      "Inlined 4 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_84 and *_87\n",
-      "consider run-time aliasing test between *_92 and *_97\n",
-      "consider run-time aliasing test between *_104 and *_107\n",
-      "consider run-time aliasing test between *_111 and *_115\n",
-      "poisson2d.c:120:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:85:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:103:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:103:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:132:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:132:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:127:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:127:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:118:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_557, _553, _563);\n",
-      "poisson2d.c:107:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:107:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:110:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:83:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_542, 0, _545);\n",
-      "poisson2d.c:67:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:69:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:38:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_542, 0, _545);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_557, _553, _563);\n",
-      "poisson2d.c:110:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _195 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _197 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _201 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _199 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_155 = malloc (_7);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_157 = malloc (_7);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_159 = malloc (_7);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_161 = malloc (_7);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_542, 0, _545);\n",
-      "poisson2d.c:91:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_239, ny_221, nx_124);\n",
-      "poisson2d.c:93:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate reference solution and time with serial CPU execution.\"[0]);\n",
-      "poisson2d.c:95:5: missed: statement clobbers memory: poisson2d_reference (iter_max_239, 1.00000000000000008180305391403130954586231382563710212708e-5, Aref_225, Anew_227, nx_124, ny_221, rhs_236);\n",
-      "poisson2d.c:98:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_557, _553, _563);\n",
-      "poisson2d.c:138:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_245, error_242);\n",
-      "poisson2d.c:145:9: missed: statement clobbers memory: _118 = check_results (1, prephitmp_426, 1, _131, 1.00000000000000008180305391403130954586231382563710212708e-5, A_125, Aref_225, nx_124);\n",
-      "poisson2d.c:154:5: missed: statement clobbers memory: free (rhs_236);\n",
-      "poisson2d.c:155:5: missed: statement clobbers memory: free (Anew_227);\n",
-      "poisson2d.c:156:5: missed: statement clobbers memory: free (Aref_225);\n",
-      "poisson2d.c:157:5: missed: statement clobbers memory: free (A_125);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_146 = malloc (2000000);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_145 = malloc (2000000);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_144 = malloc (2000000);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_130 = malloc (2000000);\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all poisson2d.c poisson2d_reference.o -o poisson2d_Ofast_info  -lm\n",
       "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
@@ -849,843 +735,78 @@
       "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
       "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
       "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
-      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
-      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
-      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
-      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
-      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
-      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
-      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
-      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
-      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
-      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
-      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
-      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
-      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
-      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n",
-      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
-      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
-      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
-      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
-      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
-      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
-      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
-      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
-      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
-      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
-      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
-      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
-      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
-      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
-      "Unit growth for small function inlining: 207->207 (0%)\n",
-      "\n",
-      "Inlined 4 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_84 and *_87\n",
-      "consider run-time aliasing test between *_92 and *_97\n",
-      "consider run-time aliasing test between *_104 and *_107\n",
-      "consider run-time aliasing test between *_111 and *_115\n",
-      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
-      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
-      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
-      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
-      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
-      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
-      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
-      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
-      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
-      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
-      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
-      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
-      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
-      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
-      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n"
-     ]
-    }
-   ],
-   "source": [
-    "!cat \"$SC19_DIR_SCRATCH\"/opt-record"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**TASK**: \n",
-    "Modify Makefile.gcc.use to include the option -fopt-info-all=$(SC19_DIR_SCRATCH)/pgo-opt-record in CFLAGS.\n",
-    "Build the profile directed feedback binary using the steps below and generate the final optimized binary and also the annotations for the profile directed feedback pass. Compare opt-record and pgo-opt-record and note the differences. which file has greater annotations. How different are the annotations when compared to each other? Can you spot any differences? "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 74,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "rm -rf /gpfs/wolf/trn003/scratch/archanaravindar//*gcda\n",
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec  -fprofile-generate=/gpfs/wolf/trn003/scratch/archanaravindar/ poisson2d.c  -o poisson2d_train  -lm \n",
-      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
-      "Job <24750> is submitted to default queue <batch>.\n",
-      "<<Waiting for dispatch ...>>\n",
-      "<<Starting on login1>>\n",
-      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
-      "Calculate current execution.\n",
-      "    0, 0.249490\n",
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec  -fprofile-use=/gpfs/wolf/trn003/scratch/archanaravindar/ -fopt-info-all=/gpfs/wolf/trn003/scratch/archanaravindar//pgo-opt-record   poisson2d.c  -o poisson2d_ref_record  -lm \n",
-      "cp poisson2d_ref_record poisson2d\n"
-     ]
-    }
-   ],
-   "source": [
-    "!make runpdf.record"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 75,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "poisson2d_reference.c:85:17: missed:   not inlinable: check_results/17 -> fprintf/20, function body not available\n",
-      "poisson2d_reference.c:70:31: missed:   not inlinable: poisson2d_reference/16 -> printf/18, function body not available\n",
-      "Unit growth for small function inlining: 145->145 (0%)\n",
-      "\n",
-      "Inlined 0 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_54 and *_57\n",
-      "consider run-time aliasing test between *_62 and *_67\n",
-      "consider run-time aliasing test between *_75 and *_78\n",
-      "consider run-time aliasing test between *_82 and *_87\n",
-      "poisson2d_reference.c:52:13: optimized: Loop 6 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:41:9: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d_reference.c:64:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:64:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:59:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:59:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:50:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:45:90: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d_reference.c:43:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d_reference.c:33:6: note: vectorized 1 loops in function.\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:43:13: optimized: loop turned into non-loop; it never loops\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:70:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_134, error_144);\n",
-      "poisson2d_reference.c:64:9: note: considering unrolling loop 5 at BB 25\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:64:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:59:9: note: considering unrolling loop 4 at BB 23\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:59:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:50:9: note: considering unrolling loop 3 at BB 21\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:50:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 9 at BB 14\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:47:25: optimized: loop unrolled 3 times (header execution count 432180)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 7 at BB 11\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:41:9: note: considering unrolling loop 10 at BB 7\n",
-      "poisson2d_reference.c:33:6: note: considering unrolling loop 2 at BB 6\n",
-      "poisson2d_reference.c:37:25: note: considering unrolling loop 1 at BB 30\n",
-      "poisson2d_reference.c:81:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:81:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:83:35: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:83:35: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:76:5: note: vectorized 0 loops in function.\n",
-      "poisson2d_reference.c:85:17: missed: statement clobbers memory: fprintf (stderr.0_13, \"ERROR: A[%d][%d] = %f does not match %f (reference)\\n\", iy_62, ix_61, _63, _64);\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 3 at BB 5\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:81:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 1 at BB 8\n",
-      "poisson2d.c:57:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:51:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:47:20: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:157:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:156:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:155:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:154:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:145:9: missed:   not inlinable: main/32 -> check_results/38, function body not available\n",
-      "poisson2d.c:138:31: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:98:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:95:5: missed:   not inlinable: main/32 -> poisson2d_reference/37, function body not available\n",
-      "poisson2d.c:93:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:91:5: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:73:29: missed:   not inlinable: main/32 -> exp/34, function body not available\n",
-      "poisson2d.c:63:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:62:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:61:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:60:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "Unit growth for small function inlining: 234->234 (0%)\n",
-      "\n",
-      "Inlined 4 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_84 and *_87\n",
-      "consider run-time aliasing test between *_92 and *_97\n",
-      "consider run-time aliasing test between *_104 and *_107\n",
-      "consider run-time aliasing test between *_111 and *_115\n",
-      "poisson2d.c:120:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:85:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:103:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:103:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:132:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:132:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:127:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:127:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:118:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:107:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:107:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:110:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:83:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:67:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:69:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:38:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:110:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _195 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _197 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _201 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _199 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_155 = malloc (_7);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_157 = malloc (_7);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_159 = malloc (_7);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_161 = malloc (_7);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:91:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_239, ny_221, nx_124);\n",
-      "poisson2d.c:93:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate reference solution and time with serial CPU execution.\"[0]);\n",
-      "poisson2d.c:95:5: missed: statement clobbers memory: poisson2d_reference (iter_max_239, 1.00000000000000008180305391403130954586231382563710212708e-5, Aref_225, Anew_227, nx_124, ny_221, rhs_236);\n",
-      "poisson2d.c:98:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:138:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_245, error_242);\n",
-      "poisson2d.c:145:9: missed: statement clobbers memory: _118 = check_results (1, ix_end_162, 1, _131, 1.00000000000000008180305391403130954586231382563710212708e-5, A_125, Aref_225, nx_124);\n",
-      "poisson2d.c:154:5: missed: statement clobbers memory: free (rhs_236);\n",
-      "poisson2d.c:155:5: missed: statement clobbers memory: free (Anew_227);\n",
-      "poisson2d.c:156:5: missed: statement clobbers memory: free (Aref_225);\n",
-      "poisson2d.c:157:5: missed: statement clobbers memory: free (A_125);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_146 = malloc (2000000);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_145 = malloc (2000000);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_144 = malloc (2000000);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_130 = malloc (2000000);\n",
-      "poisson2d.c:132:9: note: considering unrolling loop 7 at BB 47\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:132:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:127:9: note: considering unrolling loop 6 at BB 44\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:127:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:118:9: note: considering unrolling loop 5 at BB 40\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:118:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 13 at BB 27\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 9 at BB 24\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:107:9: note: considering unrolling loop 14 at BB 37\n",
-      "poisson2d.c:38:5: note: considering unrolling loop 4 at BB 35\n",
-      "poisson2d.c:103:25: note: considering unrolling loop 3 at BB 51\n",
-      "poisson2d.c:83:5: note: considering unrolling loop 2 at BB 18\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:83:5: optimized: loop unrolled 7 times (header execution count 99)\n",
-      "poisson2d.c:69:9: note: considering unrolling loop 11 at BB 9\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:69:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d.c:67:5: note: considering unrolling loop 1 at BB 14\n",
-      "poisson2d_reference.c:85:17: missed:   not inlinable: check_results/17 -> fprintf/20, function body not available\n",
-      "poisson2d_reference.c:70:31: missed:   not inlinable: poisson2d_reference/16 -> printf/18, function body not available\n",
-      "Unit growth for small function inlining: 145->145 (0%)\n",
-      "\n",
-      "Inlined 0 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_54 and *_57\n",
-      "consider run-time aliasing test between *_62 and *_67\n",
-      "consider run-time aliasing test between *_75 and *_78\n",
-      "consider run-time aliasing test between *_82 and *_87\n",
-      "poisson2d_reference.c:52:13: optimized: Loop 6 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:41:9: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d_reference.c:64:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:64:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:59:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:59:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:50:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:45:90: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d_reference.c:43:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d_reference.c:33:6: note: vectorized 1 loops in function.\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:43:13: optimized: loop turned into non-loop; it never loops\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:70:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_134, error_144);\n",
-      "poisson2d_reference.c:64:9: note: considering unrolling loop 5 at BB 25\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:64:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:59:9: note: considering unrolling loop 4 at BB 23\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:59:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:50:9: note: considering unrolling loop 3 at BB 21\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:50:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 9 at BB 14\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:47:25: optimized: loop unrolled 3 times (header execution count 432180)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 7 at BB 11\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:41:9: note: considering unrolling loop 10 at BB 7\n",
-      "poisson2d_reference.c:33:6: note: considering unrolling loop 2 at BB 6\n",
-      "poisson2d_reference.c:37:25: note: considering unrolling loop 1 at BB 30\n",
-      "poisson2d_reference.c:81:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:81:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:83:35: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:83:35: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:76:5: note: vectorized 0 loops in function.\n",
-      "poisson2d_reference.c:85:17: missed: statement clobbers memory: fprintf (stderr.0_13, \"ERROR: A[%d][%d] = %f does not match %f (reference)\\n\", iy_62, ix_61, _63, _64);\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 3 at BB 5\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:81:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 1 at BB 8\n",
-      "poisson2d.c:57:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:51:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:47:20: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:157:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:156:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:155:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:154:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:145:9: missed:   not inlinable: main/32 -> check_results/38, function body not available\n",
-      "poisson2d.c:138:31: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:98:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:95:5: missed:   not inlinable: main/32 -> poisson2d_reference/37, function body not available\n",
-      "poisson2d.c:93:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:91:5: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:73:29: missed:   not inlinable: main/32 -> exp/34, function body not available\n",
-      "poisson2d.c:63:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:62:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:61:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:60:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "Unit growth for small function inlining: 234->234 (0%)\n",
-      "\n",
-      "Inlined 4 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_84 and *_87\n",
-      "consider run-time aliasing test between *_92 and *_97\n",
-      "consider run-time aliasing test between *_104 and *_107\n",
-      "consider run-time aliasing test between *_111 and *_115\n",
-      "poisson2d.c:120:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:85:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:103:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:103:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:132:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:132:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:127:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:127:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:118:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:107:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:107:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:110:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:83:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:67:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:69:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:38:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:110:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _195 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _197 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _201 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _199 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_155 = malloc (_7);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_157 = malloc (_7);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_159 = malloc (_7);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_161 = malloc (_7);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:91:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_239, ny_221, nx_124);\n",
-      "poisson2d.c:93:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate reference solution and time with serial CPU execution.\"[0]);\n",
-      "poisson2d.c:95:5: missed: statement clobbers memory: poisson2d_reference (iter_max_239, 1.00000000000000008180305391403130954586231382563710212708e-5, Aref_225, Anew_227, nx_124, ny_221, rhs_236);\n",
-      "poisson2d.c:98:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:138:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_245, error_242);\n",
-      "poisson2d.c:145:9: missed: statement clobbers memory: _118 = check_results (1, ix_end_162, 1, _131, 1.00000000000000008180305391403130954586231382563710212708e-5, A_125, Aref_225, nx_124);\n",
-      "poisson2d.c:154:5: missed: statement clobbers memory: free (rhs_236);\n",
-      "poisson2d.c:155:5: missed: statement clobbers memory: free (Anew_227);\n",
-      "poisson2d.c:156:5: missed: statement clobbers memory: free (Aref_225);\n",
-      "poisson2d.c:157:5: missed: statement clobbers memory: free (A_125);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_146 = malloc (2000000);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_145 = malloc (2000000);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_144 = malloc (2000000);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_130 = malloc (2000000);\n",
-      "poisson2d.c:132:9: note: considering unrolling loop 7 at BB 47\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:132:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:127:9: note: considering unrolling loop 6 at BB 44\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:127:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:118:9: note: considering unrolling loop 5 at BB 40\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:118:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 13 at BB 27\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 9 at BB 24\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:107:9: note: considering unrolling loop 14 at BB 37\n",
-      "poisson2d.c:38:5: note: considering unrolling loop 4 at BB 35\n",
-      "poisson2d.c:103:25: note: considering unrolling loop 3 at BB 51\n",
-      "poisson2d.c:83:5: note: considering unrolling loop 2 at BB 18\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:83:5: optimized: loop unrolled 7 times (header execution count 99)\n",
-      "poisson2d.c:69:9: note: considering unrolling loop 11 at BB 9\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:69:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d.c:67:5: note: considering unrolling loop 1 at BB 14\n",
-      "poisson2d_reference.c:85:17: missed:   not inlinable: check_results/17 -> fprintf/20, function body not available\n",
-      "poisson2d_reference.c:70:31: missed:   not inlinable: poisson2d_reference/16 -> printf/18, function body not available\n",
-      "Unit growth for small function inlining: 145->145 (0%)\n",
-      "\n",
-      "Inlined 0 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_54 and *_57\n",
-      "consider run-time aliasing test between *_62 and *_67\n",
-      "consider run-time aliasing test between *_75 and *_78\n",
-      "consider run-time aliasing test between *_82 and *_87\n",
-      "poisson2d_reference.c:52:13: optimized: Loop 6 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:41:9: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d_reference.c:64:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:64:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:59:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:59:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:50:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:45:90: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d_reference.c:43:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d_reference.c:33:6: note: vectorized 1 loops in function.\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:43:13: optimized: loop turned into non-loop; it never loops\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:70:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_134, error_144);\n",
-      "poisson2d_reference.c:64:9: note: considering unrolling loop 5 at BB 25\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:64:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:59:9: note: considering unrolling loop 4 at BB 23\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:59:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:50:9: note: considering unrolling loop 3 at BB 21\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:50:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 9 at BB 14\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:47:25: optimized: loop unrolled 3 times (header execution count 432180)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 7 at BB 11\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:41:9: note: considering unrolling loop 10 at BB 7\n",
-      "poisson2d_reference.c:33:6: note: considering unrolling loop 2 at BB 6\n",
-      "poisson2d_reference.c:37:25: note: considering unrolling loop 1 at BB 30\n",
-      "poisson2d_reference.c:81:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:81:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:83:35: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:83:35: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:76:5: note: vectorized 0 loops in function.\n",
-      "poisson2d_reference.c:85:17: missed: statement clobbers memory: fprintf (stderr.0_13, \"ERROR: A[%d][%d] = %f does not match %f (reference)\\n\", iy_62, ix_61, _63, _64);\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 3 at BB 5\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:81:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 1 at BB 8\n",
-      "poisson2d.c:57:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:51:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:47:20: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:157:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:156:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:155:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:154:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:145:9: missed:   not inlinable: main/32 -> check_results/38, function body not available\n",
-      "poisson2d.c:138:31: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:98:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:95:5: missed:   not inlinable: main/32 -> poisson2d_reference/37, function body not available\n",
-      "poisson2d.c:93:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:91:5: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:73:29: missed:   not inlinable: main/32 -> exp/34, function body not available\n",
-      "poisson2d.c:63:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:62:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:61:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:60:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "Unit growth for small function inlining: 234->234 (0%)\n",
-      "\n",
-      "Inlined 4 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_84 and *_87\n",
-      "consider run-time aliasing test between *_92 and *_97\n",
-      "consider run-time aliasing test between *_104 and *_107\n",
-      "consider run-time aliasing test between *_111 and *_115\n",
-      "poisson2d.c:120:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:85:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:103:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:103:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:132:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:132:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:127:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:127:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:118:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:107:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:107:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:110:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:83:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:67:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:69:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:38:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:110:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _195 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _197 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _201 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _199 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_155 = malloc (_7);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_157 = malloc (_7);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_159 = malloc (_7);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_161 = malloc (_7);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:91:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_239, ny_221, nx_124);\n",
-      "poisson2d.c:93:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate reference solution and time with serial CPU execution.\"[0]);\n",
-      "poisson2d.c:95:5: missed: statement clobbers memory: poisson2d_reference (iter_max_239, 1.00000000000000008180305391403130954586231382563710212708e-5, Aref_225, Anew_227, nx_124, ny_221, rhs_236);\n",
-      "poisson2d.c:98:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:138:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_245, error_242);\n",
-      "poisson2d.c:145:9: missed: statement clobbers memory: _118 = check_results (1, ix_end_162, 1, _131, 1.00000000000000008180305391403130954586231382563710212708e-5, A_125, Aref_225, nx_124);\n",
-      "poisson2d.c:154:5: missed: statement clobbers memory: free (rhs_236);\n",
-      "poisson2d.c:155:5: missed: statement clobbers memory: free (Anew_227);\n",
-      "poisson2d.c:156:5: missed: statement clobbers memory: free (Aref_225);\n",
-      "poisson2d.c:157:5: missed: statement clobbers memory: free (A_125);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_146 = malloc (2000000);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_145 = malloc (2000000);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_144 = malloc (2000000);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_130 = malloc (2000000);\n",
-      "poisson2d.c:132:9: note: considering unrolling loop 7 at BB 47\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:132:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:127:9: note: considering unrolling loop 6 at BB 44\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:127:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:118:9: note: considering unrolling loop 5 at BB 40\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:118:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 13 at BB 27\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 9 at BB 24\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:107:9: note: considering unrolling loop 14 at BB 37\n",
-      "poisson2d.c:38:5: note: considering unrolling loop 4 at BB 35\n",
-      "poisson2d.c:103:25: note: considering unrolling loop 3 at BB 51\n",
-      "poisson2d.c:83:5: note: considering unrolling loop 2 at BB 18\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:83:5: optimized: loop unrolled 7 times (header execution count 99)\n",
-      "poisson2d.c:69:9: note: considering unrolling loop 11 at BB 9\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:69:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d.c:67:5: note: considering unrolling loop 1 at BB 14\n",
-      "poisson2d_reference.c:85:17: missed:   not inlinable: check_results/17 -> fprintf/20, function body not available\n",
-      "poisson2d_reference.c:70:31: missed:   not inlinable: poisson2d_reference/16 -> printf/18, function body not available\n",
-      "Unit growth for small function inlining: 145->145 (0%)\n",
-      "\n",
-      "Inlined 0 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_54 and *_57\n",
-      "consider run-time aliasing test between *_62 and *_67\n",
-      "consider run-time aliasing test between *_75 and *_78\n",
-      "consider run-time aliasing test between *_82 and *_87\n",
-      "poisson2d_reference.c:52:13: optimized: Loop 6 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:41:9: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d_reference.c:64:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:64:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:59:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:59:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d_reference.c:50:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:41:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:45:90: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d_reference.c:43:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d_reference.c:33:6: note: vectorized 1 loops in function.\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:43:13: optimized: loop turned into non-loop; it never loops\n",
-      "poisson2d_reference.c:33:6: missed: statement clobbers memory: __builtin_memcpy (_384, _376, _388);\n",
-      "poisson2d_reference.c:70:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_134, error_144);\n",
-      "poisson2d_reference.c:64:9: note: considering unrolling loop 5 at BB 25\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:64:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:59:9: note: considering unrolling loop 4 at BB 23\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:59:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:50:9: note: considering unrolling loop 3 at BB 21\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:50:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 9 at BB 14\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:47:25: optimized: loop unrolled 3 times (header execution count 432180)\n",
-      "poisson2d_reference.c:47:25: note: considering unrolling loop 7 at BB 11\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:41:9: note: considering unrolling loop 10 at BB 7\n",
-      "poisson2d_reference.c:33:6: note: considering unrolling loop 2 at BB 6\n",
-      "poisson2d_reference.c:37:25: note: considering unrolling loop 1 at BB 30\n",
-      "poisson2d_reference.c:81:9: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:81:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:83:35: missed: couldn't vectorize loop\n",
-      "poisson2d_reference.c:83:35: missed: not vectorized: control flow in loop.\n",
-      "poisson2d_reference.c:76:5: note: vectorized 0 loops in function.\n",
-      "poisson2d_reference.c:85:17: missed: statement clobbers memory: fprintf (stderr.0_13, \"ERROR: A[%d][%d] = %f does not match %f (reference)\\n\", iy_62, ix_61, _63, _64);\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 3 at BB 5\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d_reference.c:81:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d_reference.c:81:9: note: considering unrolling loop 1 at BB 8\n",
-      "poisson2d.c:57:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:51:14: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:47:20: optimized:   Inlining atoi/24 into main/32 (always_inline).\n",
-      "poisson2d.c:157:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:156:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:155:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:154:5: missed:   not inlinable: main/32 -> free/39, function body not available\n",
-      "poisson2d.c:145:9: missed:   not inlinable: main/32 -> check_results/38, function body not available\n",
-      "poisson2d.c:138:31: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:98:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:95:5: missed:   not inlinable: main/32 -> poisson2d_reference/37, function body not available\n",
-      "poisson2d.c:93:5: missed:   not inlinable: main/32 -> __builtin_puts/36, function body not available\n",
-      "poisson2d.c:91:5: missed:   not inlinable: main/32 -> printf/35, function body not available\n",
-      "poisson2d.c:73:29: missed:   not inlinable: main/32 -> exp/34, function body not available\n",
-      "poisson2d.c:63:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:62:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:61:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "poisson2d.c:60:41: missed:   not inlinable: main/32 -> malloc/33, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/32 -> strtol/40, function body not available\n",
-      "Unit growth for small function inlining: 234->234 (0%)\n",
-      "\n",
-      "Inlined 4 calls, eliminated 0 functions\n",
-      "\n",
-      "consider run-time aliasing test between *_84 and *_87\n",
-      "consider run-time aliasing test between *_92 and *_97\n",
-      "consider run-time aliasing test between *_104 and *_107\n",
-      "consider run-time aliasing test between *_111 and *_115\n",
-      "poisson2d.c:120:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:85:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
-      "poisson2d.c:103:25: missed: couldn't vectorize loop\n",
-      "poisson2d.c:103:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
-      "poisson2d.c:132:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:132:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:127:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:127:9: missed: Loop costings may not be worthwhile.\n",
-      "poisson2d.c:118:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:107:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:107:9: missed: not vectorized: control flow in loop.\n",
-      "poisson2d.c:110:13: optimized: loop vectorized using 16 byte vectors\n",
-      "poisson2d.c:83:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:67:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:27: missed: not vectorized: complicated access pattern.\n",
-      "poisson2d.c:69:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:73:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
-      "poisson2d.c:38:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:110:13: optimized: loop turned into non-loop; it never loops\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _195 = strtol (_1, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _197 = strtol (_2, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _201 = strtol (_3, 0B, 10);\n",
-      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _199 = strtol (_4, 0B, 10);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_155 = malloc (_7);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_157 = malloc (_7);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_159 = malloc (_7);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_161 = malloc (_7);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memset (_538, 0, _541);\n",
-      "poisson2d.c:91:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_239, ny_221, nx_124);\n",
-      "poisson2d.c:93:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate reference solution and time with serial CPU execution.\"[0]);\n",
-      "poisson2d.c:95:5: missed: statement clobbers memory: poisson2d_reference (iter_max_239, 1.00000000000000008180305391403130954586231382563710212708e-5, Aref_225, Anew_227, nx_124, ny_221, rhs_236);\n",
-      "poisson2d.c:98:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:38:5: missed: statement clobbers memory: __builtin_memcpy (_553, _549, _558);\n",
-      "poisson2d.c:138:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_245, error_242);\n",
-      "poisson2d.c:145:9: missed: statement clobbers memory: _118 = check_results (1, ix_end_162, 1, _131, 1.00000000000000008180305391403130954586231382563710212708e-5, A_125, Aref_225, nx_124);\n",
-      "poisson2d.c:154:5: missed: statement clobbers memory: free (rhs_236);\n",
-      "poisson2d.c:155:5: missed: statement clobbers memory: free (Anew_227);\n",
-      "poisson2d.c:156:5: missed: statement clobbers memory: free (Aref_225);\n",
-      "poisson2d.c:157:5: missed: statement clobbers memory: free (A_125);\n",
-      "poisson2d.c:60:41: missed: statement clobbers memory: A_146 = malloc (2000000);\n",
-      "poisson2d.c:61:41: missed: statement clobbers memory: Aref_145 = malloc (2000000);\n",
-      "poisson2d.c:62:41: missed: statement clobbers memory: Anew_144 = malloc (2000000);\n",
-      "poisson2d.c:63:41: missed: statement clobbers memory: rhs_130 = malloc (2000000);\n",
-      "poisson2d.c:132:9: note: considering unrolling loop 7 at BB 47\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:132:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:127:9: note: considering unrolling loop 6 at BB 44\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:127:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:118:9: note: considering unrolling loop 5 at BB 40\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:118:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 13 at BB 27\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
-      "poisson2d.c:114:25: note: considering unrolling loop 9 at BB 24\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:107:9: note: considering unrolling loop 14 at BB 37\n",
-      "poisson2d.c:38:5: note: considering unrolling loop 4 at BB 35\n",
-      "poisson2d.c:103:25: note: considering unrolling loop 3 at BB 51\n",
-      "poisson2d.c:83:5: note: considering unrolling loop 2 at BB 18\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:83:5: optimized: loop unrolled 7 times (header execution count 99)\n",
-      "poisson2d.c:69:9: note: considering unrolling loop 11 at BB 9\n",
-      "considering unrolling loop with constant number of iterations\n",
-      "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:69:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d.c:67:5: note: considering unrolling loop 1 at BB 14\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.\n",
+    "\n",
+    "**TASK**: \n",
+    "Adapt the `CFLAGS` of `poisson2d_ref_info` to include `-fopt-info-all` **and** the profile input of `-fprofile-use=…` here. *(Be advised: Long output!)*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
       "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "Increasing alignment of decl: __gcov0.main\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_D_00100_1_main/48 -> __gcov_exit/55, function body not available\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_I_00100_0_main/47 -> __gcov_init/54, function body not available\n",
       "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
       "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
       "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
@@ -1700,7 +821,7 @@
       "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
       "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
       "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
-      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "Unit growth for small function inlining: 295->295 (0%)\n",
       "\n",
       "Inlined 4 calls, eliminated 0 functions\n",
       "\n",
@@ -1710,6 +831,8 @@
       "consider run-time aliasing test between *_111 and *_115\n",
       "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
       "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
       "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
       "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
       "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
@@ -1717,19 +840,19 @@
       "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
       "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
       "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:122:9: missed: not vectorized: control flow in loop.\n",
       "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
       "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
       "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
       "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:88:5: missed: not vectorized: control flow in loop.\n",
       "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
-      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:72:5: missed: not vectorized: control flow in loop.\n",
       "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
       "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
       "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
       "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
       "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
       "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
@@ -1738,48 +861,60 @@
       "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
       "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
       "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
-      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_337, ny_124, nx_286);\n",
       "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
-      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
-      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
-      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
-      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
-      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
-      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
-      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
-      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n",
-      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 47\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_316, error_118);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_127);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_311);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_122);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_129 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_132 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_140 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 53\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
       "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 44\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 50\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
       "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
-      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 40\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 47\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:122:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
-      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 27\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 3 times (header execution count 9800)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 33\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
       "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
-      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 24\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 30\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 37\n",
-      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 35\n",
-      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 51\n",
-      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 18\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 42\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 40\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 60\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 23\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
-      "poisson2d.c:88:5: optimized: loop unrolled 7 times (header execution count 99)\n",
-      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 9\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 3 times (header execution count 100)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 12\n",
       "considering unrolling loop with constant number of iterations\n",
       "considering unrolling loop with runtime-computable number of iterations\n",
       "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
-      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 14\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 16\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_init (&*.LPBX0);\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_exit ();\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24908> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "libgcov profiling error:/gpfs/wolf/trn003/scratch/aherten//#autofs#nccsopen-svm1_home#aherten#SC19-Tutorial#3-Optimizing_POWER#Handson#Task1#poisson2d.gcda:overwriting an existing profile data with a different timestamp\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c poisson2d_reference.o -o poisson2d_ref_info  -lm\n",
       "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
       "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
@@ -1882,20 +1017,21 @@
     }
    ],
    "source": [
-    "!cat  \"$SC19_DIR_SCRATCH\"/pgo-opt-record"
+    "!make poisson2d_ref_info"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "exercise": "solution"
-   },
+   "metadata": {},
    "source": [
-    "Comparing the annotations generated at plain Ofast optimization level and the one generated at Ofast + profile directed feedback, we observe that many more optimizations are possible due to profile information. For instance you will see annotations such as \n",
+    "Comparing the annotations generated of a plain `-Ofast` optimization level and the one generated at `-Ofast` and profile directed feedback, we observe that many more optimizations are possible due to profile information.\n",
     "\n",
-    "`poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)`\n",
+    "For instance you will see annotations such as\n",
+    "```\n",
+    "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+    "```\n",
     "\n",
-    "The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations. \n"
+    "The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations."
    ]
   },
   {
@@ -1926,25 +1062,23 @@
     "\n",
     "### Overview\n",
     "\n",
-    "Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
-    "\n",
-    "Verify the impact by measuring cache counters with and without prefetching.\n",
-    "\n",
-    "Learn how to modify contents of DSCR (data stream control register) using IBM XL compiler and study the impact with different values to DSCR. \n",
+    "* Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
+    "* Verify the impact by measuring cache counters with and without prefetching.\n",
+    "* Learn how to modify contents of DSCR (*Data Stream Control Register*) using IBM XL compiler and study the impact with different values to DSCR. \n",
     "\n",
-    "First, change directory to that of Task 2"
+    "But first, lets change directory to that of Task 2"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 85,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "/autofs/nccsopen-svm1_home/archanaravindar/SC19-Tutorial/Task2\n"
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task2\n"
      ]
     }
    ],
@@ -1956,7 +1090,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Part A: Building with SW Prefetch flag and Running Application."
+    "### Part A: Software Prefetching"
    ]
   },
   {
@@ -1965,22 +1099,39 @@
    "source": [
     "**TASK**: Look at the Makefile and work on the TODOs. \n",
     "\n",
-    "- First generate Ofast binary and note down the performance in terms of cycles, seconds and L3 Misses. \n",
-    "- Modify Makefile to add software prefetching option (`-fprefetch-loop-arrays`). Compare performance of Ofast binary with and without SW prefetching and note down the performance."
+    "- First generate a `-Ofast`-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!\n",
+    "- Modify the `Makefile` to add the option for software prefetching (`-fprefetch-loop-arrays`). Compare performance of `-Ofast` with and without software prefetching"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 38,
+   "execution_count": 97,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rm -f poisson2d poisson2d*.o\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make clean"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm\n",
+      "make: `poisson2d' is up to date.\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24714> is submitted to default queue <batch>.\n",
+      "Job <24911> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -1995,10 +1146,10 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "2.40user 0.00system 0:02.40elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
-      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n",
+      "2.39user 0.01system 0:02.40elapsed 100%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
-      "Job <24715> is submitted to default queue <batch>.\n",
+      "Job <24912> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2016,15 +1167,33 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "        8264018720      cycles:u                                                    \n",
-      "         483312952      r168a4:u                                                    \n",
+      "        8271503902      cycles:u                                                    \n",
+      "         481152478      r168a4:u                                                    \n",
       "\n",
-      "       2.412880428 seconds time elapsed\n",
-      "\n",
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
+      "       2.412224884 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
       "cp poisson2d_pref poisson2d\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24716> is submitted to default queue <batch>.\n",
+      "Job <24919> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2039,10 +1208,10 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "1.93user 0.00system 0:01.93elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
-      "256inputs+0outputs (0major+481minor)pagefaults 0swaps\n",
+      "1.92user 0.00system 0:01.93elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
-      "Job <24717> is submitted to default queue <batch>.\n",
+      "Job <24920> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2060,45 +1229,39 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "        6573061567      cycles:u                                                    \n",
-      "         459043186      r168a4:u                                                    \n",
+      "        6586609284      cycles:u                                                    \n",
+      "         459879452      r168a4:u                                                    \n",
       "\n",
-      "       1.920153470 seconds time elapsed\n",
+      "       1.925399505 seconds time elapsed\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "!make CC=gcc poisson2d\n",
+    "!make poisson2d_pref CC=gcc\n",
     "!make run\n",
-    "!make l3missstats\n",
-    "!make CC=gcc poisson2d_pref\n",
-    "!make run\n",
-    "!make l3missstats\n"
+    "!make l3missstats"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**TASK**: Repeat the experiment with O3 flag. \n",
-    "\n",
-    "- First generate O3  binary and note down the performance in terms of cycles, seconds and L3 Misses. \n",
-    "- Modify Makefile to add software prefetching option (`-fprefetch-loop-arrays`). Compare performance of Ofast binary with and without SW prefetching and note down the performance."
+    "**TASK**: Repeat the experiment with the `-O3` flag. Have a look at the `Makefile` and the outlined TODO. There's a position to easily adapt `-Ofast`→`-O3`!"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 39,
+   "execution_count": 100,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "gcc -std=c99 -mcpu=power9 -O3    -DUSE_DOUBLE  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm\n",
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24718> is submitted to default queue <batch>.\n",
+      "Job <24923> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2116,7 +1279,7 @@
       "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
       "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
-      "Job <24719> is submitted to default queue <batch>.\n",
+      "Job <24924> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2134,15 +1297,32 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "       16233775058      cycles:u                                                    \n",
-      "         632495173      r168a4:u                                                    \n",
+      "       16445764669      cycles:u                                                    \n",
+      "         645094089      r168a4:u                                                    \n",
       "\n",
-      "       4.729401619 seconds time elapsed\n",
-      "\n",
-      "gcc -std=c99 -mcpu=power9 -O3    -DUSE_DOUBLE  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
-      "cp poisson2d_pref poisson2d\n",
+      "       4.792567763 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24720> is submitted to default queue <batch>.\n",
+      "Job <24925> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2157,10 +1337,10 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "4.72user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
-      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n",
+      "4.74user 0.00system 0:04.74elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
-      "Job <24721> is submitted to default queue <batch>.\n",
+      "Job <24926> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2178,19 +1358,16 @@
       "\n",
       " Performance counter stats for './poisson2d':\n",
       "\n",
-      "       16244358719      cycles:u                                                    \n",
-      "         619811301      r168a4:u                                                    \n",
+      "       16239159454      cycles:u                                                    \n",
+      "         631061431      r168a4:u                                                    \n",
       "\n",
-      "       4.732843397 seconds time elapsed\n",
+      "       4.730144897 seconds time elapsed\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "!make CC=gcc -B poisson2d\n",
-    "!make run\n",
-    "!make l3missstats\n",
-    "!make CC=gcc -B poisson2d_pref\n",
+    "!make poisson2d_pref CC=gcc -B\n",
     "!make run\n",
     "!make l3missstats"
    ]
@@ -2199,7 +1376,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most"
+    "Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?"
    ]
   },
   {
@@ -2208,7 +1385,7 @@
     "exercise": "solution"
    },
    "source": [
-    "Observing the results, we see that SW Prefetching seems to help at Ofast but not at O3. We can use the steps described in the the next section to verify that the compiler has not inserted any SW prefetch operations at O3 at all. That is because in the O3 binary the time is dominated by `__fmax` call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax\n",
+    "Observing the results, we see that SW Prefetching seems to help at `-Ofast` but not at `-O3`. We can use the steps described in the the next section to verify that the compiler has not inserted any SW prefetch operations at`-O3` at all. That is because in the `-O3` binary the time is dominated by `__fmax` call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of `fmax()`\n",
     "GCC may add further loop optimizations such as unrolling upon invocation of `–fprefetch-loop-arrays`.\n"
    ]
   },
@@ -2218,7 +1395,7 @@
    "source": [
     "### Part B: Analysis of Instructions\n",
     "\n",
-    "Compilation of the Ofast binary with the software prefetching flag causes the compiler to generate the `dcb*'  instructions that prefetch memory values to L3."
+    "Compilation of the `-Ofast` binary with the software prefetching flag causes the compiler to generate the `dcb*`  instructions that prefetch memory values to L3."
    ]
   },
   {
@@ -2226,33 +1403,32 @@
    "metadata": {},
    "source": [
     "**TASK**: \n",
-    "Run ` $(SC19_SUBMIT_CMD) objdump -lSd` on each binary file (O3, Ofast with prefetch/no prefetch)\n",
+    "Run `$(SC19_SUBMIT_CMD) objdump -lSd` on each binary file (`-O3`, `-Ofast` with prefetch/no prefetch).\n",
     "Look for instructions beginning with `dcb`\n",
     "At what optimization levels does the compiler generate software prefetching instructions?"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 42,
+   "execution_count": 114,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "gcc -std=c99 -mcpu=power9 -Ofast    -DUSE_DOUBLE  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
-      "cp poisson2d_pref poisson2d\n"
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n"
      ]
     }
    ],
    "source": [
     "!make CC=gcc -B poisson2d_pref\n",
-    "!objdump -lSd ./poisson2d > poisson2d.dis"
+    "!objdump -lSd ./poisson2d_pref > poisson2d.dis"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 43,
+   "execution_count": 116,
    "metadata": {},
    "outputs": [
     {
@@ -2285,22 +1461,24 @@
    "source": [
     "### Part C: Changing Values of DSCR via compiler flags\n",
     "\n",
-    "This task requires loading IBM XL compiler from the environment. \n",
+    "This task requires using the IBM XL compiler. It should be already in your environment.\n",
+    "\n",
+    "\n",
     "We saw the impact of software prefetching in the previous subsection. \n",
     "In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. \n",
     "In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching. \n",
     "\n",
-    "IBM XL compiler has an option -qprefetch=dscr=`<val>` that can be used for this purpose.\n",
-    "Compiling with -qprefetch=dscr=1 turns off the prefetcher.\n",
-    "One can give various values such as -qprefetch=dscr=4, -qprefetch=dscr=7 etc to control aggressiveness of prefetching.\n",
-    "For this exercise we use make CC=xlc_r to illustrate the performance impact.\n",
+    "IBM XL compiler has an option `-qprefetch=dscr=<val>` that can be used for this purpose.\n",
+    "Compiling with `-qprefetch=dscr=1` turns off the prefetcher. One can give various values such as `-qprefetch=dscr=4`, `-qprefetch=dscr=7` etc. to control aggressiveness of prefetching.\n",
+    "\n",
+    "For this exercise we use `make CC=xlc_r` to illustrate the performance impact.\n",
     "    \n",
     "\n",
-    "**Task** Generate XL binary by compiling using the following commands. Add -qprefetch=dscr=1 and rebuild the application and note the performance; Compare the performance of the default binary with the binary compiled with -qprefetch=dscr=1. Which one is faster? \n",
+    "**Task** Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the `Makefile`: Add `qprefetch=dscr=1` to the `CFLAGS` and rebuild the application and note the performance. Which one is faster? \n",
     "\n",
-    "In general, applications benefit with the default settings of hardware DSCR register. (-qprefetch=dscr=0). However, certain applications also benefit with prefetching turned off. \n",
+    "In general, applications benefit with the default settings of hardware DSCR register (`-qprefetch=dscr=0`). However, certain applications also benefit with prefetching turned off. \n",
     "\n",
-    "It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B. \n"
+    "It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B. "
    ]
   },
   {
@@ -2312,17 +1490,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 117,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "xlc_r  -std=c99 -qarch=pwr9 -qtune=pwr9  -Ofast   -DUSE_DOUBLE -DINLINE_LIBS   poisson2d.c -o poisson2d  -lm\n",
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  poisson2d.c -o poisson2d  -lm\n",
       "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24725> is submitted to default queue <batch>.\n",
+      "Job <24927> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2337,7 +1515,7 @@
       "  700, 345.423208\n",
       "  800, 393.963155\n",
       "  900, 442.314962\n",
-      "2.27user 0.00system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "2.26user 0.00system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
       "256inputs+0outputs (0major+477minor)pagefaults 0swaps\n"
      ]
     }
@@ -2356,39 +1534,38 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 48,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "xlc_r -std=c99 -qarch=pwr9 -qtune=pwr9  -Ofast   -DUSE_DOUBLE -DINLINE_LIBS   -qprefetch=dscr=1 poisson2d.c   -o poisson2d_dscr  -lm\n",
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  -qprefetch=dscr=1 poisson2d.c -o poisson2d_dscr  -lm\n",
       "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
-      "cp poisson2d_dscr poisson2d\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
-      "Job <24726> is submitted to default queue <batch>.\n",
+      "Job <24929> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
       "Calculate current execution.\n",
       "    0, 0.249995\n",
-      "  100, 50.149062\n",
-      "  200, 99.849327\n",
-      "  300, 149.352369\n",
-      "  400, 198.659746\n",
-      "  500, 247.773000\n",
-      "  600, 296.693652\n",
-      "  700, 345.423208\n",
-      "  800, 393.963155\n",
-      "  900, 442.314962\n",
-      "4.63user 0.01system 0:04.64elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
-      "256inputs+0outputs (0major+476minor)pagefaults 0swaps\n"
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.58user 0.00system 0:04.59elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "0inputs+0outputs (0major+476minor)pagefaults 0swaps\n"
      ]
     }
    ],
    "source": [
-    "!make CC=xlc_r -B poisson2d_dscr\n",
+    "!make poisson2d_dscr CC=xlc_r -B\n",
     "!make run"
    ]
   },
@@ -2444,14 +1621,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "/autofs/nccsopen-svm1_home/archanaravindar/SC19-Tutorial/Task3/Solutions\n"
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task3\n"
      ]
     }
    ],
@@ -2465,80 +1642,49 @@
    "source": [
     "### Part A: Implement OpenMP Pragmas; Compilation\n",
     "\n",
-    "**Task**: Please add the correct OpenMP pragmas to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.\n",
+    "**Task**: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.\n",
     "\n",
-    "* **pragmas**: Look at the TODOs in [`poisson2d.c`](/edit/Task3/poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for`\n",
-    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC and XL to Makefile.  For GCC, we need to add `-fopenmp` and the application needs to be linked with -lgomp. \n",
-    "For XL, we need to add -qsmp=omp to the list of compilation flags. \n",
+    "* **Directives**: Look at the TODOs in [`poisson2d.c`](poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for` (and once it's `#pragma omp parallel for reduction(max:error)` – can you guess where?)\n",
+    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC and XL to the `Makefile`. For GCC, we need to add `-fopenmp` and the application needs to be linked with `-lgomp`. For XL, we need to add `-qsmp=omp` to the list of compilation flags. \n",
     "\n",
     "Afterwards, compile and run the application with the following commands."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 39,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "gcc -c -std=c99 -mcpu=power9  -Ofast  -DUSE_DOUBLE  -mvsx -maltivec    -fopenmp poisson2d_reference.c -o poisson2d_reference.o -lm -lgomp \n",
-      "gcc -std=c99 -mcpu=power9  -Ofast  -DUSE_DOUBLE  -mvsx -maltivec    -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm -lgomp\n",
-      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS   time ./poisson2d\n",
-      "Job <24666> is submitted to default queue <batch>.\n",
-      "<<Waiting for dispatch ...>>\n",
-      "<<Starting on login1>>\n",
-      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
-      "Calculate reference solution and time with serial CPU execution.\n",
-      "    0, 0.249995\n",
-      "  100, 0.248997\n",
-      "  200, 0.248007\n",
-      "  300, 0.247025\n",
-      "  400, 0.246050\n",
-      "  500, 0.245084\n",
-      "  600, 0.244124\n",
-      "  700, 0.243173\n",
-      "  800, 0.242228\n",
-      "  900, 0.241291\n",
-      "Calculate current execution.\n",
-      "    0, 0.249995\n",
-      "  100, 0.248997\n",
-      "  200, 0.248007\n",
-      "  300, 0.020992\n",
-      "  400, 0.021114\n",
-      "  500, 0.245084\n",
-      "  600, 0.244124\n",
-      "  700, 0.021475\n",
-      "  800, 0.242228\n",
-      "  900, 0.021712\n",
-      "1000x1000: Ref:   2.3959 s, This:   2.1408 s, speedup:     1.12\n",
-      "11.08user 0.00system 0:04.57elapsed 242%CPU (0avgtext+0avgdata 24896maxresident)k\n",
-      "256inputs+0outputs (0major+739minor)pagefaults 0swaps\n"
+      "gcc -c -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp   poisson2d_reference.c -o poisson2d_reference.o -lm\n",
+      "gcc -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp  poisson2d.c poisson2d_reference.o -o poisson2d  -lm \n"
      ]
     }
    ],
    "source": [
-    "!make CC=gcc -B run "
+    "!make poisson2d CC=gcc"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The command to submit a job to the batch system is prepared in an environment variable $SC19_SUBMIT_CMD; use it together with eval. In the following cell, it is shown how to invoke the application using the batch system. "
+    "The command to submit a job to the batch system is prepared in an environment variable `$SC19_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to invoke the application using the batch system. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 40,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Job <24651> is submitted to default queue <batch>.\n",
+      "Job <24951> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -2554,17 +1700,17 @@
       "  800, 0.242228\n",
       "  900, 0.241291\n",
       "Calculate current execution.\n",
-      "    0, 0.249975\n",
+      "    0, 0.249995\n",
       "  100, 0.248997\n",
-      "  200, 0.020870\n",
-      "  300, 0.020992\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
       "  400, 0.246050\n",
       "  500, 0.245084\n",
       "  600, 0.244124\n",
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "1000x1000: Ref:   2.3963 s, This:   2.1526 s, speedup:     1.11\n"
+      "1000x1000: Ref:   4.7430 s, This:   3.9363 s, speedup:     1.20\n"
      ]
     }
    ],
@@ -2583,7 +1729,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 32,
    "metadata": {
     "exercise": "task"
    },
@@ -2592,21 +1738,23 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Job <24653> is submitted to default queue <batch>.\n",
+      "Job <24945> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
+      "\n",
+      "libgomp: Invalid value for environment variable OMP_NUM_THREADS\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
       "Calculate reference solution and time with serial CPU execution.\n",
       "    0, 0.249995\n",
-      "  100, 0.248997\n",
-      "  200, 0.248007\n",
-      "  300, 0.247025\n",
-      "  400, 0.246050\n",
-      "  500, 0.245084\n",
-      "  600, 0.244124\n",
-      "  700, 0.243173\n",
-      "  800, 0.242228\n",
-      "  900, 0.241291\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
       "Calculate current execution.\n",
       "    0, 0.249995\n",
       "  100, 0.248997\n",
@@ -2618,7 +1766,7 @@
       "  700, 0.243173\n",
       "  800, 0.242228\n",
       "  900, 0.241291\n",
-      "1000x1000: Ref:   2.3823 s, This:   2.7327 s, speedup:     0.87\n"
+      "1000x1000: Ref:   2.1046 s, This:   2.4171 s, speedup:     0.87\n"
      ]
     }
    ],
@@ -2628,7 +1776,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 41,
    "metadata": {
     "exercise": "solution"
    },
@@ -2639,7 +1787,7 @@
      "text": [
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "1000x1000: Ref:   2.3780 s, This:   2.7215 s, speedup:     0.87\n"
+      "1000x1000: Ref:   4.7288 s, This:   4.9791 s, speedup:     0.95\n"
      ]
     }
    ],
@@ -2649,7 +1797,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 42,
    "metadata": {
     "exercise": "solution"
    },
@@ -2660,7 +1808,7 @@
      "text": [
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "1000x1000: Ref:   2.3788 s, This:   1.3734 s, speedup:     1.73\n"
+      "1000x1000: Ref:   4.7125 s, This:   2.4914 s, speedup:     1.89\n"
      ]
     }
    ],
@@ -2670,7 +1818,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 35,
    "metadata": {
     "exercise": "solution"
    },
@@ -2681,7 +1829,7 @@
      "text": [
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "1000x1000: Ref:   2.3779 s, This:   0.8743 s, speedup:     2.72\n"
+      "1000x1000: Ref:   2.1065 s, This:   1.3836 s, speedup:     1.52\n"
      ]
     }
    ],
@@ -2842,18 +1990,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 54,
+   "execution_count": 36,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Job <24463> is submitted to default queue <batch>.\n",
+      "Job <24949> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "0-3\n",
-      "Job <24464> is submitted to default queue <batch>.\n",
+      "Job <24950> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "4-7\n"
@@ -2869,7 +2017,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the [OMP_PLACES environment Variable](https://www.openmp.org/spec-html/5.0/openmpse53.html). Examples are `OMP_PLACES` which is an OpenMP variable. We also have a GNU specific variable which can also be used to control affinity- `GOMP_CPU_AFFINITY`. Setting GOMP_CPU_AFFINITY is specific to GCC binaries but it internally serves the same function as setting OMP_PLACES. \n",
+    "There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the [OMP_PLACES environment Variable](https://www.openmp.org/spec-html/5.0/openmpse53.html). We also have a GNU specific variable which can also be used to control affinity - `GOMP_CPU_AFFINITY`. Setting `GOMP_CPU_AFFINITY` is specific to GCC binaries but it internally serves the same function as setting `OMP_PLACES`. \n",
     "\n",
     "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
     "\n",
@@ -2880,7 +2028,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 38,
    "metadata": {
     "exercise": "task"
    },
@@ -2889,13 +2037,12 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "<<Waiting for dispatch ...>>\n",
-      "<<Starting on login1>>\n"
+      "/usr/bin/sh: OMP_PLACES}={X},{Y},{Z},{A}: command not found\n"
      ]
     }
    ],
    "source": [
-    "!eval OMP_DISPLAY_ENV=true OMP_PLACES}=\"{X},{Y},{Z},{A}\" OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
+    "!eval OMP_DISPLAY_ENV=true OMP_PLACES=\"{X},{Y},{Z},{A}\" OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
    ]
   },
   {
@@ -2924,7 +2071,7 @@
     "exercise": "solution"
    },
    "source": [
-    "Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores. The binding is set using `OMP_PLACES` in this case."
+    "Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores."
    ]
   },
   {
@@ -2933,12 +2080,12 @@
     "exercise": "solution"
    },
    "source": [
-    "Using `OMP_PLACES` for binding"
+    "Using `OMP_PLACES` for binding, and using some magical Python-Bash interplay:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 43,
    "metadata": {
     "exercise": "solution"
    },
@@ -2951,12 +2098,12 @@
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "  OMP_PLACES = '{0},{1},{2},{3}'\n",
-      "1000x1000: Ref:   2.3966 s, This:   2.1410 s, speedup:     1.12\n",
+      "1000x1000: Ref:   4.7315 s, This:   3.9090 s, speedup:     1.21\n",
       "Affinity: {0},{5},{9},{13}\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "  OMP_PLACES = '{0},{5},{9},{13}'\n",
-      "1000x1000: Ref:   2.3922 s, This:   0.7029 s, speedup:     3.40\n"
+      "1000x1000: Ref:   4.6485 s, This:   1.2829 s, speedup:     3.62\n"
      ]
     }
    ],
@@ -3009,27 +2156,29 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The same experiments can be repeated with the IBM XL compiler. \n",
+    "Great!\n",
+    "\n",
+    "If you still have time: The same experiments can be repeated with the IBM XL compiler. \n",
     "The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is `-qsmp=omp`\n",
     "\n",
-    "**Task**: In the Makefile add the openMP flag and generate XL binaries with the openmp option and run the application with various number of threads and note the performance speedup."
+    "**Task**: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 44,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "xlc_r -c -std=c99 -qarch=pwr9 -qtune=pwr9  -O3 -qhot   -DUSE_DOUBLE -DINLINE_LIBS  -qsmp=omp -qsmp=omp poisson2d_reference.c -o poisson2d_reference.o -lm \n",
+      "xlc_r -c -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp    poisson2d_reference.c -o poisson2d_reference.o -lm \n",
       "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
-      "xlc_r -std=c99 -qarch=pwr9 -qtune=pwr9  -O3 -qhot   -DUSE_DOUBLE -DINLINE_LIBS  -qsmp=omp -qsmp=omp  poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "xlc_r -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
       "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS   time ./poisson2d\n",
-      "Job <24677> is submitted to default queue <batch>.\n",
+      "Job <24956> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
@@ -3045,19 +2194,19 @@
       "  800, 393.963155\n",
       "  900, 442.314962\n",
       "Calculate current execution.\n",
-      "    0, 0.249975\n",
+      "    0, 0.249995\n",
       "  100, 50.149062\n",
       "  200, 99.849327\n",
       "  300, 149.352369\n",
       "  400, 198.659746\n",
       "  500, 247.773000\n",
       "  600, 296.693652\n",
-      "  700, 29.493030\n",
-      "  800, 33.799893\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
       "  900, 442.314962\n",
-      "1000x1000: Ref:   6.0360 s, This:   2.4149 s, speedup:     2.50\n",
-      "21.04user 6.65system 0:08.48elapsed 326%CPU (0avgtext+0avgdata 23040maxresident)k\n",
-      "256inputs+0outputs (0major+1099minor)pagefaults 0swaps\n"
+      "1000x1000: Ref:   5.6783 s, This:   2.6528 s, speedup:     2.14\n",
+      "21.56user 6.18system 0:08.37elapsed 331%CPU (0avgtext+0avgdata 23040maxresident)k\n",
+      "3200inputs+0outputs (2major+1098minor)pagefaults 0swaps\n"
      ]
     }
    ],
@@ -3212,7 +2361,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 45,
    "metadata": {
     "exercise": "solution"
    },
@@ -3223,7 +2372,7 @@
      "text": [
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "1000x1000: Ref:   2.3086 s, This:   0.2739 s, speedup:     8.43\n"
+      "1000x1000: Ref:   2.1678 s, This:   0.2869 s, speedup:     7.56\n"
      ]
     }
    ],
@@ -3304,30 +2453,30 @@
     "\n",
     "Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.\n",
     "\n",
+    "We are mixing Python with Bash (`!`) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two `$$`)\n",
+    "\n",
     "What's your maximum speedup?"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 38,
+   "execution_count": null,
    "metadata": {
-    "exercise": "solution"
+    "exercise": "task"
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Affinity: {0},{1},{2},{3}\n",
-      "<<Waiting for dispatch ...>>\n",
-      "<<Starting on login1>>\n",
-      "  OMP_PLACES='{0},{1},{2},{3}' custom\n",
-      "1000x1000: Ref:   6.0379 s, This:   2.4154 s, speedup:     2.50\n",
-      "Affinity: {0},{5},{9},{13}\n",
+      "Affinity: {X},{Y},{Z},{A}\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "  OMP_PLACES='{0},{5},{9},{13}' custom\n",
-      "1000x1000: Ref:   2.3051 s, This:   0.6816 s, speedup:     3.38\n"
+      "1587-117 The string for the OpenMP environment variable 'OMP_PLACES' contains unexpected or invalid text.  OpenMP environment variable ignored. \n",
+      "  OMP_PLACES='cores(44)'\n",
+      "1000x1000: Ref:   2.0988 s, This:   0.6556 s, speedup:     3.20\n",
+      "Affinity: {P},{Q},{R},{S}\n",
+      "<<Waiting for dispatch ...>>\n"
      ]
     }
    ],
@@ -3398,7 +2547,7 @@
    "source": [
     "# Survey<a name=\"survey\"></a>\n",
     "\n",
-    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc18-eval)."
+    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc19-eval)."
    ]
   }
  ],
diff --git a/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.html b/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.html
index 42729e6fe376f7a80104c0812d0c494ee943c9e7..53ef1e0297ef9b6c833894d108f5ce480ef23bec 100644
--- a/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.html
+++ b/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.html
@@ -13017,45 +13017,6 @@ ul.typeahead-list  > li > a.pull-right {
 .highlight .vm { color: #19177C } /* Name.Variable.Magic */
 .highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
     </style>
-<style type="text/css">
-    
-/* Temporary definitions which will become obsolete with Notebook release 5.0 */
-.ansi-black-fg { color: #3E424D; }
-.ansi-black-bg { background-color: #3E424D; }
-.ansi-black-intense-fg { color: #282C36; }
-.ansi-black-intense-bg { background-color: #282C36; }
-.ansi-red-fg { color: #E75C58; }
-.ansi-red-bg { background-color: #E75C58; }
-.ansi-red-intense-fg { color: #B22B31; }
-.ansi-red-intense-bg { background-color: #B22B31; }
-.ansi-green-fg { color: #00A250; }
-.ansi-green-bg { background-color: #00A250; }
-.ansi-green-intense-fg { color: #007427; }
-.ansi-green-intense-bg { background-color: #007427; }
-.ansi-yellow-fg { color: #DDB62B; }
-.ansi-yellow-bg { background-color: #DDB62B; }
-.ansi-yellow-intense-fg { color: #B27D12; }
-.ansi-yellow-intense-bg { background-color: #B27D12; }
-.ansi-blue-fg { color: #208FFB; }
-.ansi-blue-bg { background-color: #208FFB; }
-.ansi-blue-intense-fg { color: #0065CA; }
-.ansi-blue-intense-bg { background-color: #0065CA; }
-.ansi-magenta-fg { color: #D160C4; }
-.ansi-magenta-bg { background-color: #D160C4; }
-.ansi-magenta-intense-fg { color: #A03196; }
-.ansi-magenta-intense-bg { background-color: #A03196; }
-.ansi-cyan-fg { color: #60C6C8; }
-.ansi-cyan-bg { background-color: #60C6C8; }
-.ansi-cyan-intense-fg { color: #258F8F; }
-.ansi-cyan-intense-bg { background-color: #258F8F; }
-.ansi-white-fg { color: #C5C1B4; }
-.ansi-white-bg { background-color: #C5C1B4; }
-.ansi-white-intense-fg { color: #A1A6B2; }
-.ansi-white-intense-bg { background-color: #A1A6B2; }
-
-.ansi-bold { font-weight: bold; }
-
-    </style>
 
 
 <style type="text/css">
@@ -13089,7 +13050,7 @@ div#notebook {
 
 <!-- Loading mathjax macro -->
 <!-- Load mathjax -->
-    <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_HTML"></script>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS_HTML"></script>
     <!-- MathJax configuration -->
     <script type="text/x-mathjax-config">
     MathJax.Hub.Config({
@@ -13116,7 +13077,7 @@ div#notebook {
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h1 id="Hands-On-Performance-Optimization">Hands-On Performance Optimization<a class="anchor-link" href="#Hands-On-Performance-Optimization">&#182;</a></h1><p><em>Supercomputing 2018 Tutorial "Application Porting and Optimization on GPU-Accelerated POWER Architectures", November 12th 2018</em></p>
+<h1 id="Hands-On-Performance-Optimization">Hands-On Performance Optimization<a class="anchor-link" href="#Hands-On-Performance-Optimization">&#182;</a></h1><p><em>Supercomputing 2019 Tutorial "Application Porting and Optimization on GPU-Accelerated POWER Architectures", November 18th 2019</em></p>
 <hr>
 
 </div>
@@ -13128,7 +13089,7 @@ div#notebook {
 <p>As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.</p>
 <h2 id="Jupyter-notebook-execution">Jupyter notebook execution<a class="anchor-link" href="#Jupyter-notebook-execution">&#182;</a></h2><p>When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the <em>edit</em> menu above.</p>
 <p>You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.</p>
-<p>If you want you also can get a <a href="/terminals/1">terminal</a> in your browser.</p>
+<p>If you want you also can get a terminal in your browser; just open it via the »New Launcher« button (<code>+</code>).</p>
 <h2 id="Terminal-fallback">Terminal fallback<a class="anchor-link" href="#Terminal-fallback">&#182;</a></h2><p>The tasks are place in directories named <code>Task[1-3]</code>.</p>
 <p>Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description.</p>
 
@@ -13138,10 +13099,23 @@ div#notebook {
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Setup">Setup<a class="anchor-link" href="#Setup">&#182;</a></h2><p>This hands-on session requires of GCC 6.4.0. By loading the <code>sc18/handson2</code> module before invoking this Notebook, we took care of also loading GCC 6.4.0 into the environment.</p>
+<h2 id="Setup">Setup<a class="anchor-link" href="#Setup">&#182;</a></h2><p>We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!</p>
 
 </div>
 </div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>gcc --version
+</pre></div>
+
+    </div>
+</div>
+</div>
+
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
@@ -13149,17 +13123,19 @@ div#notebook {
 <h2 id="Tasks">Tasks<a name="top" /><a class="anchor-link" href="#Tasks">&#182;</a></h2><p>This session comes with multiple tasks, each one to be found in the respective sub-directory <code>Task[1-3]</code>. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.</p>
 <p>Please choose from the task below.</p>
 <ul>
-<li><p><a href="#task1">Task 1</a>: Compile Flags<br>
-Improve performance of the CPU Jacobi solver with compiler flags such as <code>Ofast</code> and profile-directed feedback (<a href="#solution0">Solution 1</a>)</p>
-</li>
-<li><p><a href="#task2">Task 2</a>: Software Prefetching<br>
-Improve performance of the CPU Jacobi solver with software prefetching (<a href="#solution1">Solution 2</a>)</p>
-</li>
-<li><p><a href="#task3">Task 3</a>: OpenMP<br>
-Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance (<a href="#solution2">Solution 3</a>)</p>
-</li>
-<li><p><a href="#survey">Suvery</a> Please remember to take the survey !</p>
-</li>
+<li><a href="#task1">Task 1</a>: <strong>Basic compiler optimization flags and compiler annotations</strong></li>
+</ul>
+<p>Improve performance of the CPU Jacobi solver with compiler flags such as <code>Ofast</code> and profile-directed feedback. Learn about compiler annotations.</p>
+<ul>
+<li><a href="#task2">Task 2</a>: <strong>Optimization via Prefetching controlled by compiler</strong></li>
+</ul>
+<p>Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance.</p>
+<ul>
+<li><a href="#task3">Task 3</a>: <strong>Optimization via OpenMP controlled by compiler and the system</strong></li>
+</ul>
+<p>Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance.</p>
+<ul>
+<li><a href="#survey">Suvery</a> Please remember to take the survey !</li>
 </ul>
 <h3 id="Make-Targets-">Make Targets <a name="make" /><a class="anchor-link" href="#Make-Targets-">&#182;</a></h3><p>For all tasks we have defined the following make targets.</p>
 <ul>
@@ -13184,11 +13160,13 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-1:-Compile-Flags-">Task 1: Compile Flags <a name="task1" /><a class="anchor-link" href="#Task-1:-Compile-Flags-">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver</p>
+<h2 id="Task-1:-Basic-compiler-optimization-flags-and-compiler-annotations-">Task 1: Basic compiler optimization flags and compiler annotations <a name="task1" /><a class="anchor-link" href="#Task-1:-Basic-compiler-optimization-flags-and-compiler-annotations-">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver</p>
 <p>Your task is to:</p>
 <ul>
 <li>Optimize performance with <code>-Ofast</code> flag</li>
+<li>Verify the cause for performance improvement by viewing perf profiles of O3 and Ofast binaries </li>
 <li>Optimize performance with profile directed feedback </li>
+<li>Generate compiler annotations/remarks to understand the optimizations done by the compiler with and without profile directed feedback </li>
 </ul>
 <p>First, change the working directory to <code>Task1</code>.</p>
 
@@ -13211,7 +13189,8 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-A:--Ofast-vs.--O3">Part A: <code>-Ofast</code> vs. <code>-O3</code><a class="anchor-link" href="#Part-A:--Ofast-vs.--O3">&#182;</a></h3><p>We are to compare the performance of the binary being compiled with <code>-Ofast</code> optimization and with <code>-O3</code> optimization. Right now, the Makefile specifies <code>-O3</code> as the optimization flag. Compile the code using <code>make</code> and run it with <code>make run</code> in the next two cells.</p>
+<h3 id="Part-A:--Ofast-vs.--O3">Part A: <code>-Ofast</code> vs. <code>-O3</code><a class="anchor-link" href="#Part-A:--Ofast-vs.--O3">&#182;</a></h3><p>We are to compare the performance of the binary being compiled with <code>-Ofast</code> optimization and with <code>-O3</code> optimization. As in the previous task, we use a <code>Makefile</code> for compilation. The <code>Makefile</code> targets <code>poisson2d_O3</code> and <code>poisson2d_Ofast</code> are already prepared.</p>
+<p><strong>TASK</strong>: Add <code>-O3</code> as the optimization flag for the <code>poisson2d_O3</code> target by using the corresponding <code>CFLAGS</code> definition. There are notes relating to this Task 1 in the header of the <code>Makefile</code>. Compile the code using <code>make</code> as indicated below and run with the <code>Make</code> targets <code>run</code>, <code>run_perf</code> and <code>run_perf_recrep</code>.</p>
 
 </div>
 </div>
@@ -13221,7 +13200,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_O3
 </pre></div>
 
     </div>
@@ -13245,7 +13224,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>You can use the GNU <em>perf</em> tool to profile the application using the <code>perf</code> command (see below) and see the top time-consuming functions.</p>
+<p>Let's have a look at the output of the <code>Makefile</code> target <code>run_perf</code>. It invokes the GNU <em>perf</em> tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to <em>perf</em>.</p>
 
 </div>
 </div>
@@ -13255,10 +13234,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># perf record creates a perf.data file </span>
-<span class="o">!</span>perf record -o perf.O3.data -e cycles ./poisson2d
-<span class="c1"># perf report opens the perf.data file </span>
-<span class="o">!</span>perf report -i perf.O3.data <span class="p">|</span> cat
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf
 </pre></div>
 
     </div>
@@ -13269,7 +13245,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p><strong>TASK</strong>: Now change the optimization flag in the <a href="/edit/Task1/Makefile">Makefile</a> to <code>-Ofast</code> and repeat the steps in the following cell. In case you follow along non-interactive, call <code>make</code> and <code>make run</code> in your shell. (If you are in the Jupyter Notebook, you can actually click the link of the <a href="/edit/Task1/Makefile">Makefile</a>. In other cases, use <code>vim</code> which is installed on Ascent.)</p>
+<p>Next we run the makefile with target <code>run_perf_recrep</code> that prints the top routines of the application in terms of hotness by using a combination of <code>perf record ./app</code> and <code>perf report</code>.</p>
 
 </div>
 </div>
@@ -13279,36 +13255,73 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make
+<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># run_perf_recrep displays the top hot routines </span>
+<span class="o">!</span>make run_perf_recrep
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p><strong>TASK</strong>: Now add the optimization flag <code>Ofast</code> to the <code>CFLAGS</code> for target <code>poisson2d_Ofast</code>. Compile the program with the target <code>poisson2d_Ofast</code> and run and analyse it as before with <code>run</code>, <code>run_perf</code> and <code>run_perf_recrep</code>.</p>
+<p>What difference do you see?</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_Ofast 
+<span class="o">!</span>make run
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Again, run a <code>perf</code>-instrumented version:</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># perf record creates a perf.data file </span>
-<span class="o">!</span>perf record -o perf.Ofast.data -e cycles ./poisson2d
-<span class="c1"># perf report opens the perf.data file </span>
-<span class="o">!</span>perf report -i perf.Ofast.data <span class="p">|</span> cat
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Generate the list of top routines in terms of hotness:</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf_recrep
 </pre></div>
 
     </div>
@@ -13327,7 +13340,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h4 id="Interpretation">Interpretation<a class="anchor-link" href="#Interpretation">&#182;</a></h4><p>Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with <code>-Ofast</code> which enables <code>–ffast-math</code> option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the <code>-Ofast</code> binary natively implements the <code>fmax</code> function using instructions available in the hardware. The <code>-O3</code> binary makes a library call to compute <code>fmax</code> to follow a stricter <em>IEEE</em> requirement for accuracy.</p>
+<h4 id="Interpretation">Interpretation<a class="anchor-link" href="#Interpretation">&#182;</a></h4><p>Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with <code>-Ofast</code> which enables <code>–ffast-math</code> option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the <code>-Ofast</code> binary natively implements the <code>fmax</code> function using instructions available in the hardware. The <code>-O3</code> binary makes a library call to compute <code>fmax</code> to follow a stricter <em>IEEE</em> requirement for accuracy.</p>
 
 </div>
 </div>
@@ -13335,17 +13348,26 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-B:-Profile-directed-Feedback">Part B: Profile-directed Feedback<a class="anchor-link" href="#Part-B:-Profile-directed-Feedback">&#182;</a></h3><p>For the first level of optimization we saw <code>Ofast</code> cut the execution time of the <code>O3</code> binary by almost half.</p>
-<p>We can optimize the performance further by using profile directed feedback optimization.</p>
-<p>To compile using profile directed feedback with the GCC compiler we need to do the following steps</p>
+<h3 id="Part-B:-Profile-directed-Feedback">Part B: Profile-directed Feedback<a class="anchor-link" href="#Part-B:-Profile-directed-Feedback">&#182;</a></h3><p>For the first level of optimization we see that <code>Ofast</code> cut the execution time of the <code>O3</code> binary by almost half.</p>
+<p>We can optimize the performance further by using profile-directed feedback optimization.</p>
+<p>To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:</p>
 <ol>
-<li>We need to first build a training binary using <code>-fprofile-generate</code>; this instructs the compiler to record hot path information </li>
-<li>Run the training binary with a smaller input size; you should see a <code>.gcda</code> file generated which stores hot path information for further optimization by the compiler </li>
-<li>build the final binary using <code>-fprofile-use</code> which uses the profile information in the <code>.gcda</code> file </li>
-<li>Compare the performance of the final binary with the original <code>Ofast</code> binary </li>
+<li>Instrument binary;</li>
+<li>Run binary with training, gather profile information;</li>
+<li>Use profile information to generate optimized binary.</li>
 </ol>
-<p><strong>TASK</strong>: First, search for <code>TODO1</code> in the <a href="/edit/Task1/Makefile">Makefile</a>. It defines an additional compilation flag for <code>gcc</code>. Insert <code>-fprofile-generate=FOLDER</code> there with FOLDER pointing to <code>$$SC18_DIR_SCRATCH</code>, your personal write-directory (the double dollar signs are intentional as they are used to escape in the GNU Make syntax).</p>
-<p>After editing, run the following two cells to train your program.</p>
+<p>Step 1 is achieved by compiling the binary with the correct flag – <code>-fprofile-generate</code>. In our case, we need to specify an output location, which should be <code>$(SC19_DIR_SCRATCH)</code>.</p>
+<p>Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension <code>.gcda</code>) is written to the directory specified during compilation.</p>
+<p>For Step 3, the binary is once again compiled, but this time using the <code>gcda</code> profile just generated. The according flag is <code>-fprofile-use</code>, which we set to <code>$(SC19_DIR_SCRATCH)</code> as well.</p>
+<p>In our <code>Makefile</code> at hand, we prepared the steps already for you in the form of two targets.</p>
+<ul>
+<li><code>poisson2d_train</code>: Will compile the binary with profile-directed feedback</li>
+<li><code>poisson2d_ref</code>: Will take a generated profile and compile a new, optimized binary</li>
+</ul>
+<p>By using dependencies, between these two targets a profile run is launched.</p>
+<p><strong>TASK</strong>: Edit the <a href="`Makefile`">Makefile</a> and add the <code>-fprofile-*</code> flags to the <code>CFLAGS</code> of <code>poisson2d_train</code> and
+<code>poisson2d_ref</code> as outline in the file.</p>
+<p>After that, you may launch them with the following cells (<code>gen_profile</code> is a meta-target and uses <code>poisson2d_train</code> and <code>poisson2d_ref</code>). If you need to clean the generated profile, you may use <code>make clean_profile</code>.</p>
 
 </div>
 </div>
@@ -13355,20 +13377,28 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_train
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make gen_profile
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_train
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
 </pre></div>
 
     </div>
@@ -13379,9 +13409,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>Now, a <code>.gcda</code> file exists in the directory which can be used for an profile-accelerated subsequent run.</p>
-<p><strong>TASK</strong>: Edit the <a href="/edit/Task1/Makefile">Makefile</a> again, this time modifying <code>TODO2</code> to be equivalent to <code>-fprofile-use</code>. A directory is not needed as we copied the gcda file into the current directory.</p>
-<p>Run the following cells in order to build using the newly added flag and then run with the profile-accelerated version.</p>
+<p>Let's also measure instructions and cycles</p>
 
 </div>
 </div>
@@ -13391,20 +13419,38 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_profile
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-C:-Compiler-annotations/Remarks">Part C: Compiler annotations/Remarks<a class="anchor-link" href="#Part-C:-Compiler-annotations/Remarks">&#182;</a></h3><p>Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done.</p>
+<p>To generate compiler annotations using GCC, one uses <code>-fopt-info-all</code>. If you only want to see the missed options, use the option <code>-fopt-info-missed</code> instead of <code>-fopt-info-all</code>. See also the <a href="https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info">documentation of GCC regarding the flag</a>.</p>
+<p><strong>TASK</strong>: Have a looK at the <code>CFLAGS</code> of the <code>Makefile</code> target <code>poisson2d_Ofast_info</code>. Add the flag <code>-fopt-info-all</code> to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use – for example – <code>-fopt-info-all=(SC19_DIR_SCRATCH)/filename</code>.</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_profile
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_Ofast_info
 </pre></div>
 
     </div>
@@ -13415,7 +13461,34 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)</p>
+<p>Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.</p>
+<p><strong>TASK</strong>: 
+Adapt the <code>CFLAGS</code> of <code>poisson2d_ref_info</code> to include <code>-fopt-info-all</code> <strong>and</strong> the profile input of <code>-fprofile-use=…</code> here. <em>(Be advised: Long output!)</em></p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_ref_info
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Comparing the annotations generated of a plain <code>-Ofast</code> optimization level and the one generated at <code>-Ofast</code> and profile directed feedback, we observe that many more optimizations are possible due to profile information.</p>
+<p>For instance you will see annotations such as</p>
+
+<pre><code>poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)</code></pre>
+<p>The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations.</p>
 
 </div>
 </div>
@@ -13443,8 +13516,12 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-2:-Software-Pretechting">Task 2:<a name="task2" /> Software Pretechting<a class="anchor-link" href="#Task-2:-Software-Pretechting">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>Study the difference of program execution time of different optimization levels with and without software prefetching.</p>
-<p>First, change directory to that of Task 2</p>
+<h2 id="Task-2:-Impact-of-Prefetching-on-Performance">Task 2:<a name="task2" /> Impact of Prefetching on Performance<a class="anchor-link" href="#Task-2:-Impact-of-Prefetching-on-Performance">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><ul>
+<li>Study the difference of program execution time of different optimization levels with and without software prefetching.</li>
+<li>Verify the impact by measuring cache counters with and without prefetching.</li>
+<li>Learn how to modify contents of DSCR (<em>Data Stream Control Register</em>) using IBM XL compiler and study the impact with different values to DSCR. </li>
+</ul>
+<p>But first, lets change directory to that of Task 2</p>
 
 </div>
 </div>
@@ -13465,15 +13542,18 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-A:-Running">Part A: Running<a class="anchor-link" href="#Part-A:-Running">&#182;</a></h3>
+<h3 id="Part-A:-Software-Prefetching">Part A: Software Prefetching<a class="anchor-link" href="#Part-A:-Software-Prefetching">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>Look at the <a href="/edit/Task2/Makefile">Makefile</a> and work on the TODOs. Please implement compile flags as mentioned in the Makefile target name.</p>
-<p>Afterwards, compile each target with the following cells and submit them to the batch system. Follow along accordingly in the non-interactive version of this Notebook.</p>
+<p><strong>TASK</strong>: Look at the Makefile and work on the TODOs.</p>
+<ul>
+<li>First generate a <code>-Ofast</code>-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!</li>
+<li>Modify the <code>Makefile</code> to add the option for software prefetching (<code>-fprefetch-loop-arrays</code>). Compare performance of <code>-Ofast</code> with and without software prefetching</li>
+</ul>
 
 </div>
 </div>
@@ -13483,7 +13563,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_o3_pref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make clean
 </pre></div>
 
     </div>
@@ -13496,7 +13576,9 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_ofast_pref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d <span class="nv">CC</span><span class="o">=</span>gcc
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
@@ -13509,31 +13591,66 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_o3_nopref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_pref <span class="nv">CC</span><span class="o">=</span>gcc
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p><strong>TASK</strong>: Repeat the experiment with the <code>-O3</code> flag. Have a look at the <code>Makefile</code> and the outlined TODO. There's a position to easily adapt <code>-Ofast</code>→<code>-O3</code>!</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_ofast_nopref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d <span class="nv">CC</span><span class="o">=</span>gcc -B
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_pref <span class="nv">CC</span><span class="o">=</span>gcc -B
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>Do you notice the impact difference with optimization levels? It's always important to carefully study the interplay of flags.</p>
+<h3 id="Part-B:-Analysis-of-Instructions">Part B: Analysis of Instructions<a class="anchor-link" href="#Part-B:-Analysis-of-Instructions">&#182;</a></h3><p>Compilation of the <code>-Ofast</code> binary with the software prefetching flag causes the compiler to generate the <code>dcb*</code>  instructions that prefetch memory values to L3.</p>
 
 </div>
 </div>
@@ -13541,8 +13658,10 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-B:-Analysis-of-Instructions">Part B: Analysis of Instructions<a class="anchor-link" href="#Part-B:-Analysis-of-Instructions">&#182;</a></h3><p>Compilation with the software prefetching flag causes the compiler to generate the <code>__dcbt</code> and <code>__dcbtst</code>  instructions that prefetch memory values to L3.</p>
-<p>Verify it using <code>objdump -lSd</code> on each file (<code>poisson2d_o3_pref</code>, <code>poisson2d_ofast_pref</code>, <code>poisson2d_o3_nopref</code>, <code>poisson2d_ofast_nopref</code>). You might want to grep for <code>dcb</code>.</p>
+<p><strong>TASK</strong>: 
+Run <code>$(SC19_SUBMIT_CMD) objdump -lSd</code> on each binary file (<code>-O3</code>, <code>-Ofast</code> with prefetch/no prefetch).
+Look for instructions beginning with <code>dcb</code>
+At what optimization levels does the compiler generate software prefetching instructions?</p>
 
 </div>
 </div>
@@ -13552,7 +13671,60 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="ch">#!objdump -l…</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make <span class="nv">CC</span><span class="o">=</span>gcc -B poisson2d_pref
+<span class="o">!</span>objdump -lSd ./poisson2d_pref &gt; poisson2d.dis
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>grep dcb poisson2d.dis
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-C:-Changing-Values-of-DSCR-via-compiler-flags">Part C: Changing Values of DSCR via compiler flags<a class="anchor-link" href="#Part-C:-Changing-Values-of-DSCR-via-compiler-flags">&#182;</a></h3><p>This task requires using the IBM XL compiler. It should be already in your environment.</p>
+<p>We saw the impact of software prefetching in the previous subsection. 
+In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. 
+In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching.</p>
+<p>IBM XL compiler has an option <code>-qprefetch=dscr=&lt;val&gt;</code> that can be used for this purpose.
+Compiling with <code>-qprefetch=dscr=1</code> turns off the prefetcher. One can give various values such as <code>-qprefetch=dscr=4</code>, <code>-qprefetch=dscr=7</code> etc. to control aggressiveness of prefetching.</p>
+<p>For this exercise we use <code>make CC=xlc_r</code> to illustrate the performance impact.</p>
+<p><strong>Task</strong> Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the <code>Makefile</code>: Add <code>qprefetch=dscr=1</code> to the <code>CFLAGS</code> and rebuild the application and note the performance. Which one is faster?</p>
+<p>In general, applications benefit with the default settings of hardware DSCR register (<code>-qprefetch=dscr=0</code>). However, certain applications also benefit with prefetching turned off.</p>
+<p>It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Measure performance of the application compiled with XL at default DSCR value</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make <span class="nv">CC</span><span class="o">=</span>xlc_r -B poisson2d
+<span class="o">!</span>make run
 </pre></div>
 
     </div>
@@ -13563,7 +13735,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>If you feel up to the task, you can study the number of L3 cache misses using the corresponding performance counter, <code>PM_L3_MISS</code>. Either use your knowledge from Hands-On 1, or use the following call to <code>perf</code>, in which we already converted the named counter to a raw counter address.</p>
+<p>Measure performance of the application compiled with XL with DSCR value turned off</p>
 
 </div>
 </div>
@@ -13573,14 +13745,22 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;poisson2d_ofast_nopref&quot;</span><span class="p">,</span> <span class="s2">&quot;poisson2d_ofast_pref&quot;</span><span class="p">]:</span>
-    <span class="o">!</span><span class="nb">eval</span> <span class="nv">$$</span>SC18_SUBMIT_CMD perf stat -e cycles,r168a4 ./<span class="nv">$f</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_dscr <span class="nv">CC</span><span class="o">=</span>xlc_r -B
+<span class="o">!</span>make run
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher?</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
@@ -13588,6 +13768,7 @@ build <code>poisson2d</code> binary (default)</li>
 <h4 id="References">References<a class="anchor-link" href="#References">&#182;</a></h4><ol>
 <li><a href="https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html">https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html</a></li>
 <li><a href="https://www.gnu.org/software/gcc/projects/prefetch.html">https://www.gnu.org/software/gcc/projects/prefetch.html</a></li>
+<li><a href="https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0">https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0</a></li>
 </ol>
 
 </div>
@@ -13606,8 +13787,8 @@ build <code>poisson2d</code> binary (default)</li>
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h2 id="Task-3:-OpenMP">Task 3: OpenMP<a class="anchor-link" href="#Task-3:-OpenMP">&#182;</a></h2><p><a name="task3"></a></p>
-<h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores.</p>
-<p>First, we need to change directory to that of Task3.</p>
+<h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.
+First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is <code>poisson2d_reference</code>. We parallelize only the main loop but not <code>poisson2d_reference</code>. The speedup is the performance gain seen in the main loop as compared to the reference loop.</p>
 
 </div>
 </div>
@@ -13628,13 +13809,12 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-A:-Implement-OpenMP-Pragmas;-Compilation">Part A: Implement OpenMP Pragmas; Compilation<a class="anchor-link" href="#Part-A:-Implement-OpenMP-Pragmas;-Compilation">&#182;</a></h3><p><strong>Task</strong>: Please add the correct OpenMP pragmas to the source code and compilations flags to enable OpenMP.</p>
+<h3 id="Part-A:-Implement-OpenMP-Pragmas;-Compilation">Part A: Implement OpenMP Pragmas; Compilation<a class="anchor-link" href="#Part-A:-Implement-OpenMP-Pragmas;-Compilation">&#182;</a></h3><p><strong>Task</strong>: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.</p>
 <ul>
-<li><strong>pragmas</strong>: Look at the TODOs in <a href="/edit/Task3/poisson2d.c"><code>poisson2d.c</code></a> to add OpenMP parallelism. The pragmas in question are <code>#pragma  omp parallel for</code></li>
-<li><strong>Compilation</strong>: Please add compilation flags enabling OpenMP in GCC to the <a href="/edit/Task3/Makefile">Makefile</a>. The flag in question is <code>-fopenmp</code>.</li>
+<li><strong>Directives</strong>: Look at the TODOs in <a href="poisson2d.c"><code>poisson2d.c</code></a> to add OpenMP parallelism. The pragmas in question are <code>#pragma  omp parallel for</code> (and once it's <code>#pragma omp parallel for reduction(max:error)</code> – can you guess where?)</li>
+<li><strong>Compilation</strong>: Please add compilation flags enabling OpenMP in GCC and XL to the <code>Makefile</code>. For GCC, we need to add <code>-fopenmp</code> and the application needs to be linked with <code>-lgomp</code>. For XL, we need to add <code>-qsmp=omp</code> to the list of compilation flags. </li>
 </ul>
-<p>Edit the files with the links above if you are running the interactive version of the Notebook or navigate to <code>poisson2d.c</code> and <code>Makefile</code> yourself in case you run the non-interactive version.</p>
-<p>Afterwards, compile and run the application with the following cells. Non-interactive: Follow along accordingly in the shell.</p>
+<p>Afterwards, compile and run the application with the following commands.</p>
 
 </div>
 </div>
@@ -13644,20 +13824,28 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d <span class="nv">CC</span><span class="o">=</span>gcc
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>The command to submit a job to the batch system is prepared in an environment variable <code>$SC19_SUBMIT_CMD</code>; use it together with <code>eval</code>. In the following cell, it is shown how to invoke the application using the batch system.</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC19_SUBMIT_CMD</span> ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>
 </pre></div>
 
     </div>
@@ -13668,7 +13856,9 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>The command to submit a job to the batch system is prepared in an environment variable <code>$SC18_SUBMIT_CMD</code>; use it together with <code>eval</code>. In the following cell, it is shown how to increase the work of the application.</p>
+<p>Inorder to run the parallel application, we need to set the number of threads using <code>OMP_NUM_THREADS</code>
+What is the best performance you can reach by setting the number of threads via <code>OMP_NUM_THREADS=N</code> with <code>N</code> being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.<br>
+We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler of Ascent, from overlaying binding options. Also, we use <code>-c ALL_CPUS</code> to make all CPUs on the compute nodes available to you.</p>
 
 </div>
 </div>
@@ -13678,7 +13868,7 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC18_SUBMIT_CMD</span> ./poisson2d <span class="m">1000</span> <span class="m">1000</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>N <span class="nv">$SC19_SUBMIT_CMD</span> -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>
 </pre></div>
 
     </div>
@@ -13689,8 +13879,13 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>What is the best performance you can reach by setting the number of threads via <code>OMP_NUM_THREADS=N</code> with <code>N</code> being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.<br>
-We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler of Ascent, from overlaying binding options. Also, we use <code>-c ALL_CPUS</code> to make all CPUs on the compute nodes available to you.</p>
+<h3 id="Part-B:-Bindings">Part B: Bindings<a class="anchor-link" href="#Part-B:-Bindings">&#182;</a></h3><p>Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!</p>
+<p>There are applications which can be used to determine the configuration of the processor. Among those are:</p>
+<ul>
+<li><code>lscpu</code>: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.</li>
+<li><code>ppc64_cpu --smt</code>: Specifically for POWER, this tool can give information about the number of simulations threads running per core (<em>SMT</em>, Simulataion Multi-Threading).</li>
+</ul>
+<p>Run <code>ppc64_cpu --smt</code> to find out about the threading configuration of Ascent!</p>
 
 </div>
 </div>
@@ -13700,7 +13895,7 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">1</span> <span class="nv">$SC18_SUBMIT_CMD</span> -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC19_SUBMIT_CMD</span> ppc64_cpu --smt
 </pre></div>
 
     </div>
@@ -13711,13 +13906,11 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-B:-Bindings">Part B: Bindings<a class="anchor-link" href="#Part-B:-Bindings">&#182;</a></h3><p>Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!</p>
-<p>There are applications which can be used to determine the configuration of the processor. Among those are:</p>
+<p>There are more sources information available</p>
 <ul>
-<li><code>lscpu</code>: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.</li>
-<li><code>ppc64_cpu --smt</code>: Specifically for POWER, this tool can give information about the number of simulations threads running per core (<em>SMT</em>, Simulataion Multi-Threading).</li>
+<li><code>/proc/cpuinfo</code>: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with <code>cat</code></li>
+<li><code>/sys/devices/system/cpu/cpu0/topology/thread_siblings_list</code>: Holds information about thread siblings for given CPU core (<code>cpu0</code> in this case). Use it to find out which thread is mapped to which core.</li>
 </ul>
-<p>Run <code>ppc64_cpu --smt</code> to find out about the threading configuration of Ascent!</p>
 
 </div>
 </div>
@@ -13727,7 +13920,8 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC18_SUBMIT_CMD</span> ppc64_cpu --smt
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">$$</span>SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
+<span class="o">!</span><span class="nv">$$</span>SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list
 </pre></div>
 
     </div>
@@ -13738,11 +13932,47 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>There are more sources information available</p>
-<ul>
-<li><code>/proc/cpuinfo</code>: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with <code>cat</code></li>
-<li><code>/sys/devices/system/cpu/cpu0/topology/thread_siblings_list</code>: Holds information about thread siblings for given CPU core (<code>cpu0</code> in this case). Use it to find out which thread is mapped to which core.</li>
-</ul>
+<p>There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the <a href="https://www.openmp.org/spec-html/5.0/openmpse53.html">OMP_PLACES environment Variable</a>. We also have a GNU specific variable which can also be used to control affinity - <code>GOMP_CPU_AFFINITY</code>. Setting <code>GOMP_CPU_AFFINITY</code> is specific to GCC binaries but it internally serves the same function as setting <code>OMP_PLACES</code>.</p>
+<p><strong>Task</strong>: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.</p>
+<p>Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.</p>
+<p>What's your maximum speedup?</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">OMP_PLACES</span><span class="o">=</span><span class="s2">&quot;{X},{Y},{Z},{A}&quot;</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">100</span> <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">GOMP_CPU_AFFINITY</span><span class="o">=</span><span class="s2">&quot;X,Y,Z,A&quot;</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">100</span> <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Great!</p>
+<p>If you still have time: The same experiments can be repeated with the IBM XL compiler. 
+The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is <code>-qsmp=omp</code></p>
+<p><strong>Task</strong>: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup.</p>
 
 </div>
 </div>
@@ -13752,8 +13982,7 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
-<span class="o">!</span>cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make <span class="nv">CC</span><span class="o">=</span>xlc_r -B run
 </pre></div>
 
     </div>
@@ -13764,9 +13993,31 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>There are various environment variables available within OpenMP (and GCC) to specify binding of threads to cores. See, for instance, the <a href="https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html">online documentation of GCC libgomp</a>. Examples are <code>OMP_PLACES</code> or <code>GOMP_CPU_AFFINITY</code>.</p>
+<p>Run the parallel application with varying numbre of threads (<code>OMP_NUM_THREADS</code>) and note the performance improvement.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>N <span class="nv">$SC19_SUBMIT_CMD</span> -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Now we repeat the exercise of using the right binding of threads for the XL binary. <code>OMP_PLACES</code> pertains to the XL binary as well as it is an OpenMP variable.  <code>GOMP_CPU_AFFINITY</code> is specific to GCC binary so that cannot be used to set the binding.</p>
 <p><strong>Task</strong>: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.</p>
 <p>Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.</p>
+<p>We are mixing Python with Bash (<code>!</code>) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two <code>$$</code>)</p>
 <p>What's your maximum speedup?</p>
 
 </div>
@@ -13777,19 +14028,30 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="prompt input_prompt">In&nbsp;[&nbsp;]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">GOMP_CPU_AFFINITY</span><span class="o">=</span><span class="s2">&quot;X,Y,Z,A&quot;</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">100</span> <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">affinity</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;</span><span class="si">{X}</span><span class="s2">,</span><span class="si">{Y}</span><span class="s2">,</span><span class="si">{Z}</span><span class="s2">,</span><span class="si">{A}</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;</span><span class="si">{P}</span><span class="s2">,</span><span class="si">{Q}</span><span class="s2">,</span><span class="si">{R}</span><span class="s2">,</span><span class="si">{S}</span><span class="s2">&quot;</span><span class="p">]:</span>
+    <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Affinity: </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">affinity</span><span class="p">))</span>
+    <span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">OMP_PLACES</span><span class="o">=</span><span class="nv">$affinity</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>  <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
 </pre></div>
 
     </div>
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h4 id="References">References<a class="anchor-link" href="#References">&#182;</a></h4><ol>
 <li><a href="https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html">https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html</a></li>
+<li><a href="https://www.openmp.org/spec-html/5.0/openmpse53.html">https://www.openmp.org/spec-html/5.0/openmpse53.html</a></li>
 </ol>
 
 </div>
@@ -13807,7 +14069,7 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h1 id="Survey">Survey<a name="survey" /><a class="anchor-link" href="#Survey">&#182;</a></h1><p>Please rememeber to take some time and fill out the <a href="http://bit.ly/sc18-eval">survey</a>.</p>
+<h1 id="Survey">Survey<a name="survey" /><a class="anchor-link" href="#Survey">&#182;</a></h1><p>Please rememeber to take some time and fill out the <a href="http://bit.ly/sc19-eval">survey</a>.</p>
 
 </div>
 </div>
diff --git a/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.ipynb b/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.ipynb
index a28d887db7b35f5dcc2385b79dc04e0701b7a25d..64fdf698aba6d2c4fbfee09fa68326d52e6e1853 100644
--- a/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.ipynb
+++ b/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.ipynb
@@ -5,7 +5,7 @@
    "metadata": {},
    "source": [
     "# Hands-On Performance Optimization\n",
-    "_Supercomputing 2018 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 12th 2018_\n",
+    "_Supercomputing 2019 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 18th 2019_\n",
     "\n",
     "---"
    ]
@@ -22,7 +22,7 @@
     "\n",
     "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
     "\n",
-    "If you want you also can get a [terminal](/terminals/1) in your browser.\n",
+    "If you want you also can get a terminal in your browser; just open it via the »New Launcher« button (`+`).\n",
     "\n",
     "## Terminal fallback\n",
     "\n",
@@ -37,7 +37,16 @@
    "source": [
     "## Setup\n",
     "\n",
-    "This hands-on session requires of GCC 6.4.0. By loading the `sc18/handson2` module before invoking this Notebook, we took care of also loading GCC 6.4.0 into the environment."
+    "We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!gcc --version"
    ]
   },
   {
@@ -51,14 +60,16 @@
     "Please choose from the task below.\n",
     "\n",
     "\n",
-    "* [Task 1](#task1): Compile Flags  \n",
-    "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback ([Solution 1](#solution0))\n",
+    "* [Task 1](#task1): __Basic compiler optimization flags and compiler annotations__\n",
     "\n",
-    "* [Task 2](#task2): Software Prefetching  \n",
-    "Improve performance of the CPU Jacobi solver with software prefetching ([Solution 2](#solution1))\n",
+    "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback. Learn about compiler annotations.\n",
     "\n",
-    "* [Task 3](#task3): OpenMP  \n",
-    "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance ([Solution 3](#solution2))\n",
+    "* [Task 2](#task2): __Optimization via Prefetching controlled by compiler__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance. \n",
+    "* [Task 3](#task3): __Optimization via OpenMP controlled by compiler and the system__\n",
+    "\n",
+    "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance. \n",
     "  \n",
     "* [Suvery](#survey) Please remember to take the survey !\n",
     "    \n",
@@ -85,7 +96,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Task 1: Compile Flags <a name=\"task1\"></a>\n",
+    "## Task 1: Basic compiler optimization flags and compiler annotations <a name=\"task1\"></a>\n",
     "\n",
     "\n",
     "### Overview\n",
@@ -95,7 +106,10 @@
     "Your task is to:\n",
     "\n",
     "* Optimize performance with `-Ofast` flag\n",
+    "* Verify the cause for performance improvement by viewing perf profiles of O3 and Ofast binaries \n",
     "* Optimize performance with profile directed feedback \n",
+    "* Generate compiler annotations/remarks to understand the optimizations done by the compiler with and without profile directed feedback \n",
+    "\n",
     "\n",
     "First, change the working directory to `Task1`."
    ]
@@ -115,7 +129,9 @@
    "source": [
     "### Part A: `-Ofast` vs. `-O3`\n",
     "\n",
-    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. Right now, the Makefile specifies `-O3` as the optimization flag. Compile the code using `make` and run it with `make run` in the next two cells."
+    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. As in the previous task, we use a `Makefile` for compilation. The `Makefile` targets `poisson2d_O3` and `poisson2d_Ofast` are already prepared. \n",
+    "\n",
+    "**TASK**: Add `-O3` as the optimization flag for the `poisson2d_O3` target by using the corresponding `CFLAGS` definition. There are notes relating to this Task 1 in the header of the `Makefile`. Compile the code using `make` as indicated below and run with the `Make` targets `run`, `run_perf` and `run_perf_recrep`. "
    ]
   },
   {
@@ -124,7 +140,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make"
+    "!make poisson2d_O3"
    ]
   },
   {
@@ -140,7 +156,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can use the GNU _perf_ tool to profile the application using the `perf` command (see below) and see the top time-consuming functions."
+    "Let's have a look at the output of the `Makefile` target `run_perf`. It invokes the GNU _perf_ tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to _perf_."
    ]
   },
   {
@@ -149,17 +165,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# perf record creates a perf.data file \n",
-    "!perf record -o perf.O3.data -e cycles ./poisson2d\n",
-    "# perf report opens the perf.data file \n",
-    "!perf report -i perf.O3.data | cat"
+    "!make run_perf"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**TASK**: Now change the optimization flag in the [Makefile](/edit/Task1/Makefile) to `-Ofast` and repeat the steps in the following cell. In case you follow along non-interactive, call `make` and `make run` in your shell. (If you are in the Jupyter Notebook, you can actually click the link of the [Makefile](/edit/Task1/Makefile). In other cases, use `vim` which is installed on Ascent.)"
+    "Next we run the makefile with target `run_perf_recrep` that prints the top routines of the application in terms of hotness by using a combination of `perf record ./app` and `perf report`. "
    ]
   },
   {
@@ -168,7 +181,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make"
+    "# run_perf_recrep displays the top hot routines \n",
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Now add the optimization flag `Ofast` to the `CFLAGS` for target `poisson2d_Ofast`. Compile the program with the target `poisson2d_Ofast` and run and analyse it as before with `run`, `run_perf` and `run_perf_recrep`.\n",
+    "\n",
+    "What difference do you see?"
    ]
   },
   {
@@ -177,19 +200,40 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "!make poisson2d_Ofast \n",
     "!make run"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, run a `perf`-instrumented version:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# perf record creates a perf.data file \n",
-    "!perf record -o perf.Ofast.data -e cycles ./poisson2d\n",
-    "# perf report opens the perf.data file \n",
-    "!perf report -i perf.Ofast.data | cat"
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Generate the list of top routines in terms of hotness:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!make run_perf_recrep"
    ]
   },
   {
@@ -205,7 +249,7 @@
    "source": [
     "####  Interpretation\n",
     "\n",
-    "Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with `-Ofast` which enables `–ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
+    "Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with `-Ofast` which enables `–ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
    ]
   },
   {
@@ -214,20 +258,34 @@
    "source": [
     "### Part B: Profile-directed Feedback\n",
     "\n",
-    "For the first level of optimization we saw `Ofast` cut the execution time of the `O3` binary by almost half.\n",
+    "For the first level of optimization we see that `Ofast` cut the execution time of the `O3` binary by almost half.\n",
+    "\n",
+    "We can optimize the performance further by using profile-directed feedback optimization.\n",
+    "\n",
+    "To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:\n",
+    "\n",
+    "1. Instrument binary;\n",
+    "2. Run binary with training, gather profile information;\n",
+    "3. Use profile information to generate optimized binary.\n",
+    "\n",
+    "\n",
+    "Step 1 is achieved by compiling the binary with the correct flag – `-fprofile-generate`. In our case, we need to specify an output location, which should be `$(SC19_DIR_SCRATCH)`.\n",
+    "\n",
+    "Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension `.gcda`) is written to the directory specified during compilation.\n",
     "\n",
-    "We can optimize the performance further by using profile directed feedback optimization.\n",
+    "For Step 3, the binary is once again compiled, but this time using the `gcda` profile just generated. The according flag is `-fprofile-use`, which we set to `$(SC19_DIR_SCRATCH)` as well.\n",
     "\n",
-    "To compile using profile directed feedback with the GCC compiler we need to do the following steps\n",
+    "In our `Makefile` at hand, we prepared the steps already for you in the form of two targets.\n",
     "\n",
-    "1. We need to first build a training binary using `-fprofile-generate`; this instructs the compiler to record hot path information \n",
-    "2. Run the training binary with a smaller input size; you should see a `.gcda` file generated which stores hot path information for further optimization by the compiler \n",
-    "3. build the final binary using `-fprofile-use` which uses the profile information in the `.gcda` file \n",
-    "4. Compare the performance of the final binary with the original `Ofast` binary \n",
+    "* `poisson2d_train`: Will compile the binary with profile-directed feedback\n",
+    "* `poisson2d_ref`: Will take a generated profile and compile a new, optimized binary\n",
     "\n",
-    "**TASK**: First, search for `TODO1` in the [Makefile](/edit/Task1/Makefile). It defines an additional compilation flag for `gcc`. Insert `-fprofile-generate=FOLDER` there with FOLDER pointing to `$$SC18_DIR_SCRATCH`, your personal write-directory (the double dollar signs are intentional as they are used to escape in the GNU Make syntax).\n",
+    "By using dependencies, between these two targets a profile run is launched.\n",
     "\n",
-    "After editing, run the following two cells to train your program."
+    "**TASK**: Edit the [Makefile](`Makefile`) and add the `-fprofile-*` flags to the `CFLAGS` of `poisson2d_train` and\n",
+    "`poisson2d_ref` as outline in the file.\n",
+    "\n",
+    "After that, you may launch them with the following cells (`gen_profile` is a meta-target and uses `poisson2d_train` and `poisson2d_ref`). If you need to clean the generated profile, you may use `make clean_profile`."
    ]
   },
   {
@@ -236,7 +294,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make poisson2d_train"
+    "!make gen_profile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!"
    ]
   },
   {
@@ -245,18 +310,43 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make run_train"
+    "!make run"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, a `.gcda` file exists in the directory which can be used for an profile-accelerated subsequent run.\n",
+    "Let's also measure instructions and cycles"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Compiler annotations/Remarks\n",
     "\n",
-    "**TASK**: Edit the [Makefile](/edit/Task1/Makefile) again, this time modifying `TODO2` to be equivalent to `-fprofile-use`. A directory is not needed as we copied the gcda file into the current directory.\n",
+    "Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done. \n",
     "\n",
-    "Run the following cells in order to build using the newly added flag and then run with the profile-accelerated version."
+    "To generate compiler annotations using GCC, one uses `-fopt-info-all`. If you only want to see the missed options, use the option `-fopt-info-missed` instead of `-fopt-info-all`. See also the [documentation of GCC regarding the flag](https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info).\n",
+    "\n",
+    "**TASK**: Have a looK at the `CFLAGS` of the `Makefile` target `poisson2d_Ofast_info`. Add the flag `-fopt-info-all` to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use – for example – `-fopt-info-all=(SC19_DIR_SCRATCH)/filename`."
    ]
   },
   {
@@ -265,7 +355,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make poisson2d_profile"
+    "!make poisson2d_Ofast_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.\n",
+    "\n",
+    "**TASK**: \n",
+    "Adapt the `CFLAGS` of `poisson2d_ref_info` to include `-fopt-info-all` **and** the profile input of `-fprofile-use=…` here. *(Be advised: Long output!)*"
    ]
   },
   {
@@ -274,14 +374,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make run_profile"
+    "!make poisson2d_ref_info"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
+    "Comparing the annotations generated of a plain `-Ofast` optimization level and the one generated at `-Ofast` and profile directed feedback, we observe that many more optimizations are possible due to profile information.\n",
+    "\n",
+    "For instance you will see annotations such as\n",
+    "```\n",
+    "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+    "```\n",
+    "\n",
+    "The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations."
    ]
   },
   {
@@ -307,14 +414,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Task 2:<a name=\"task2\"></a> Software Pretechting\n",
+    "## Task 2:<a name=\"task2\"></a> Impact of Prefetching on Performance\n",
     "\n",
     "\n",
     "### Overview\n",
     "\n",
-    "Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
+    "* Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
+    "* Verify the impact by measuring cache counters with and without prefetching.\n",
+    "* Learn how to modify contents of DSCR (*Data Stream Control Register*) using IBM XL compiler and study the impact with different values to DSCR. \n",
     "\n",
-    "First, change directory to that of Task 2"
+    "But first, lets change directory to that of Task 2"
    ]
   },
   {
@@ -330,16 +439,26 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Part A: Running"
+    "### Part A: Software Prefetching"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Look at the [Makefile](/edit/Task2/Makefile) and work on the TODOs. Please implement compile flags as mentioned in the Makefile target name.\n",
+    "**TASK**: Look at the Makefile and work on the TODOs. \n",
     "\n",
-    "Afterwards, compile each target with the following cells and submit them to the batch system. Follow along accordingly in the non-interactive version of this Notebook."
+    "- First generate a `-Ofast`-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!\n",
+    "- Modify the `Makefile` to add the option for software prefetching (`-fprefetch-loop-arrays`). Compare performance of `-Ofast` with and without software prefetching"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!make clean"
    ]
   },
   {
@@ -348,7 +467,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make run_o3_pref"
+    "!make poisson2d CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
    ]
   },
   {
@@ -357,7 +478,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make run_ofast_pref"
+    "!make poisson2d_pref CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Repeat the experiment with the `-O3` flag. Have a look at the `Makefile` and the outlined TODO. There's a position to easily adapt `-Ofast`→`-O3`!"
    ]
   },
   {
@@ -366,7 +496,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make run_o3_nopref"
+    "!make poisson2d CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
    ]
   },
   {
@@ -375,14 +507,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make run_ofast_nopref"
+    "!make poisson2d_pref CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Do you notice the impact difference with optimization levels? It's always important to carefully study the interplay of flags."
+    "Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?"
    ]
   },
   {
@@ -391,27 +525,86 @@
    "source": [
     "### Part B: Analysis of Instructions\n",
     "\n",
-    "Compilation with the software prefetching flag causes the compiler to generate the `__dcbt` and `__dcbtst`  instructions that prefetch memory values to L3.\n",
+    "Compilation of the `-Ofast` binary with the software prefetching flag causes the compiler to generate the `dcb*`  instructions that prefetch memory values to L3."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: \n",
+    "Run `$(SC19_SUBMIT_CMD) objdump -lSd` on each binary file (`-O3`, `-Ofast` with prefetch/no prefetch).\n",
+    "Look for instructions beginning with `dcb`\n",
+    "At what optimization levels does the compiler generate software prefetching instructions?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!make CC=gcc -B poisson2d_pref\n",
+    "!objdump -lSd ./poisson2d_pref > poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!grep dcb poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Changing Values of DSCR via compiler flags\n",
+    "\n",
+    "This task requires using the IBM XL compiler. It should be already in your environment.\n",
+    "\n",
+    "\n",
+    "We saw the impact of software prefetching in the previous subsection. \n",
+    "In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. \n",
+    "In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching. \n",
+    "\n",
+    "IBM XL compiler has an option `-qprefetch=dscr=<val>` that can be used for this purpose.\n",
+    "Compiling with `-qprefetch=dscr=1` turns off the prefetcher. One can give various values such as `-qprefetch=dscr=4`, `-qprefetch=dscr=7` etc. to control aggressiveness of prefetching.\n",
+    "\n",
+    "For this exercise we use `make CC=xlc_r` to illustrate the performance impact.\n",
+    "    \n",
+    "\n",
+    "**Task** Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the `Makefile`: Add `qprefetch=dscr=1` to the `CFLAGS` and rebuild the application and note the performance. Which one is faster? \n",
     "\n",
-    "Verify it using `objdump -lSd` on each file (`poisson2d_o3_pref`, `poisson2d_ofast_pref`, `poisson2d_o3_nopref`, `poisson2d_ofast_nopref`). You might want to grep for `dcb`."
+    "In general, applications benefit with the default settings of hardware DSCR register (`-qprefetch=dscr=0`). However, certain applications also benefit with prefetching turned off. \n",
+    "\n",
+    "It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL at default DSCR value"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "sc18": "task"
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
-    "#!objdump -l…"
+    "!make CC=xlc_r -B poisson2d\n",
+    "!make run"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "If you feel up to the task, you can study the number of L3 cache misses using the corresponding performance counter, `PM_L3_MISS`. Either use your knowledge from Hands-On 1, or use the following call to `perf`, in which we already converted the named counter to a raw counter address."
+    "Measure performance of the application compiled with XL with DSCR value turned off"
    ]
   },
   {
@@ -420,8 +613,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "for f in [\"poisson2d_ofast_nopref\", \"poisson2d_ofast_pref\"]:\n",
-    "    !eval $$SC18_SUBMIT_CMD perf stat -e cycles,r168a4 ./$f\n"
+    "!make poisson2d_dscr CC=xlc_r -B\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher? "
    ]
   },
   {
@@ -431,7 +631,8 @@
     "#### References\n",
     "\n",
     "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
-    "2. https://www.gnu.org/software/gcc/projects/prefetch.html"
+    "2. https://www.gnu.org/software/gcc/projects/prefetch.html\n",
+    "3. https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0\n"
    ]
   },
   {
@@ -453,20 +654,14 @@
     "\n",
     "### Overview\n",
     "\n",
-    "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores.\n",
-    "\n",
-    "First, we need to change directory to that of Task3."
+    "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.\n",
+    "First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is `poisson2d_reference`. We parallelize only the main loop but not `poisson2d_reference`. The speedup is the performance gain seen in the main loop as compared to the reference loop."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2018-11-07T13:47:57.724441Z",
-     "start_time": "2018-11-07T13:47:57.718745Z"
-    }
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "%cd ../Task3"
@@ -478,14 +673,12 @@
    "source": [
     "### Part A: Implement OpenMP Pragmas; Compilation\n",
     "\n",
-    "**Task**: Please add the correct OpenMP pragmas to the source code and compilations flags to enable OpenMP.\n",
-    "\n",
-    "* **pragmas**: Look at the TODOs in [`poisson2d.c`](/edit/Task3/poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for`\n",
-    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC to the [Makefile](/edit/Task3/Makefile). The flag in question is `-fopenmp`.\n",
+    "**Task**: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.\n",
     "\n",
-    "Edit the files with the links above if you are running the interactive version of the Notebook or navigate to `poisson2d.c` and `Makefile` yourself in case you run the non-interactive version.\n",
+    "* **Directives**: Look at the TODOs in [`poisson2d.c`](poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for` (and once it's `#pragma omp parallel for reduction(max:error)` – can you guess where?)\n",
+    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC and XL to the `Makefile`. For GCC, we need to add `-fopenmp` and the application needs to be linked with `-lgomp`. For XL, we need to add `-qsmp=omp` to the list of compilation flags. \n",
     "\n",
-    "Afterwards, compile and run the application with the following cells. Non-interactive: Follow along accordingly in the shell."
+    "Afterwards, compile and run the application with the following commands."
    ]
   },
   {
@@ -494,23 +687,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!make poisson2d"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!make run"
+    "!make poisson2d CC=gcc"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The command to submit a job to the batch system is prepared in an environment variable `$SC18_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to increase the work of the application."
+    "The command to submit a job to the batch system is prepared in an environment variable `$SC19_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to invoke the application using the batch system. "
    ]
   },
   {
@@ -519,13 +703,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!eval $SC18_SUBMIT_CMD ./poisson2d 1000 1000"
+    "!eval $SC19_SUBMIT_CMD ./poisson2d 1000 1000 1000"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Inorder to run the parallel application, we need to set the number of threads using `OMP_NUM_THREADS`\n",
     "What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n",
     "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."
    ]
@@ -534,11 +719,11 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "sc18": "task"
+    "exercise": "task"
    },
    "outputs": [],
    "source": [
-    "!eval OMP_NUM_THREADS=1 $SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000"
+    "!eval OMP_NUM_THREADS=N $SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000"
    ]
   },
   {
@@ -563,7 +748,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!eval $SC18_SUBMIT_CMD ppc64_cpu --smt"
+    "!eval $SC19_SUBMIT_CMD ppc64_cpu --smt"
    ]
   },
   {
@@ -582,15 +767,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n",
-    "!cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n",
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "There are various environment variables available within OpenMP (and GCC) to specify binding of threads to cores. See, for instance, the [online documentation of GCC libgomp](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html). Examples are `OMP_PLACES` or `GOMP_CPU_AFFINITY`.\n",
+    "There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the [OMP_PLACES environment Variable](https://www.openmp.org/spec-html/5.0/openmpse53.html). We also have a GNU specific variable which can also be used to control affinity - `GOMP_CPU_AFFINITY`. Setting `GOMP_CPU_AFFINITY` is specific to GCC binaries but it internally serves the same function as setting `OMP_PLACES`. \n",
     "\n",
     "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
     "\n",
@@ -603,11 +788,96 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "sc18": "task"
+    "exercise": "task"
+   },
+   "outputs": [],
+   "source": [
+    "!eval OMP_DISPLAY_ENV=true OMP_PLACES=\"{X},{Y},{Z},{A}\" OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "exercise": "task"
    },
    "outputs": [],
    "source": [
-    "!eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=\"X,Y,Z,A\" OMP_NUM_THREADS=4 $$SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
+    "!eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=\"X,Y,Z,A\" OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Great!\n",
+    "\n",
+    "If you still have time: The same experiments can be repeated with the IBM XL compiler. \n",
+    "The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is `-qsmp=omp`\n",
+    "\n",
+    "**Task**: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!make CC=xlc_r -B run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the parallel application with varying numbre of threads (`OMP_NUM_THREADS`) and note the performance improvement. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [],
+   "source": [
+    "!eval OMP_NUM_THREADS=N $SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we repeat the exercise of using the right binding of threads for the XL binary. `OMP_PLACES` pertains to the XL binary as well as it is an OpenMP variable.  `GOMP_CPU_AFFINITY` is specific to GCC binary so that cannot be used to set the binding.\n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "We are mixing Python with Bash (`!`) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two `$$`)\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "exercise": "task"
+   },
+   "outputs": [],
+   "source": [
+    "for affinity in [\"{X},{Y},{Z},{A}\", \"{P},{Q},{R},{S}\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000  | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements \n"
    ]
   },
   {
@@ -615,7 +885,8 @@
    "metadata": {},
    "source": [
     "#### References\n",
-    "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html"
+    "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html\n",
+    "2. https://www.openmp.org/spec-html/5.0/openmpse53.html"
    ]
   },
   {
@@ -633,7 +904,7 @@
    "source": [
     "# Survey<a name=\"survey\"></a>\n",
     "\n",
-    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc18-eval)."
+    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc19-eval)."
    ]
   }
  ],
@@ -653,7 +924,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.7.0"
   }
  },
  "nbformat": 4,
diff --git a/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.pdf b/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.pdf
index 00d2d949244f7bf9ddf9aca6f4e1db96b1adf874..c441f9df6b2b229cc5a4f31e855c472c5a6e884b 100644
Binary files a/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.pdf and b/3-Optimizing_POWER/Handson/HandsOnPerformanceOptimization.pdf differ
diff --git a/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.html b/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.html
index 8847efff6eb0957c55dfbcfa1cd3a201274f321b..639b1d1cdbea3815b2dd7690a5f6d6ac6ff35bf7 100644
--- a/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.html
+++ b/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.html
@@ -13017,45 +13017,6 @@ ul.typeahead-list  > li > a.pull-right {
 .highlight .vm { color: #19177C } /* Name.Variable.Magic */
 .highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
     </style>
-<style type="text/css">
-    
-/* Temporary definitions which will become obsolete with Notebook release 5.0 */
-.ansi-black-fg { color: #3E424D; }
-.ansi-black-bg { background-color: #3E424D; }
-.ansi-black-intense-fg { color: #282C36; }
-.ansi-black-intense-bg { background-color: #282C36; }
-.ansi-red-fg { color: #E75C58; }
-.ansi-red-bg { background-color: #E75C58; }
-.ansi-red-intense-fg { color: #B22B31; }
-.ansi-red-intense-bg { background-color: #B22B31; }
-.ansi-green-fg { color: #00A250; }
-.ansi-green-bg { background-color: #00A250; }
-.ansi-green-intense-fg { color: #007427; }
-.ansi-green-intense-bg { background-color: #007427; }
-.ansi-yellow-fg { color: #DDB62B; }
-.ansi-yellow-bg { background-color: #DDB62B; }
-.ansi-yellow-intense-fg { color: #B27D12; }
-.ansi-yellow-intense-bg { background-color: #B27D12; }
-.ansi-blue-fg { color: #208FFB; }
-.ansi-blue-bg { background-color: #208FFB; }
-.ansi-blue-intense-fg { color: #0065CA; }
-.ansi-blue-intense-bg { background-color: #0065CA; }
-.ansi-magenta-fg { color: #D160C4; }
-.ansi-magenta-bg { background-color: #D160C4; }
-.ansi-magenta-intense-fg { color: #A03196; }
-.ansi-magenta-intense-bg { background-color: #A03196; }
-.ansi-cyan-fg { color: #60C6C8; }
-.ansi-cyan-bg { background-color: #60C6C8; }
-.ansi-cyan-intense-fg { color: #258F8F; }
-.ansi-cyan-intense-bg { background-color: #258F8F; }
-.ansi-white-fg { color: #C5C1B4; }
-.ansi-white-bg { background-color: #C5C1B4; }
-.ansi-white-intense-fg { color: #A1A6B2; }
-.ansi-white-intense-bg { background-color: #A1A6B2; }
-
-.ansi-bold { font-weight: bold; }
-
-    </style>
 
 
 <style type="text/css">
@@ -13089,7 +13050,7 @@ div#notebook {
 
 <!-- Loading mathjax macro -->
 <!-- Load mathjax -->
-    <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_HTML"></script>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/latest.js?config=TeX-AMS_HTML"></script>
     <!-- MathJax configuration -->
     <script type="text/x-mathjax-config">
     MathJax.Hub.Config({
@@ -13116,7 +13077,7 @@ div#notebook {
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h1 id="Hands-On-Performance-Optimization">Hands-On Performance Optimization<a class="anchor-link" href="#Hands-On-Performance-Optimization">&#182;</a></h1><p><em>Supercomputing 2018 Tutorial "Application Porting and Optimization on GPU-Accelerated POWER Architectures", November 12th 2018</em></p>
+<h1 id="Hands-On-Performance-Optimization">Hands-On Performance Optimization<a class="anchor-link" href="#Hands-On-Performance-Optimization">&#182;</a></h1><p><em>Supercomputing 2019 Tutorial "Application Porting and Optimization on GPU-Accelerated POWER Architectures", November 18th 2019</em></p>
 <hr>
 
 </div>
@@ -13128,7 +13089,7 @@ div#notebook {
 <p>As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.</p>
 <h2 id="Jupyter-notebook-execution">Jupyter notebook execution<a class="anchor-link" href="#Jupyter-notebook-execution">&#182;</a></h2><p>When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the <em>edit</em> menu above.</p>
 <p>You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.</p>
-<p>If you want you also can get a <a href="/terminals/1">terminal</a> in your browser.</p>
+<p>If you want you also can get a terminal in your browser; just open it via the »New Launcher« button (<code>+</code>).</p>
 <h2 id="Terminal-fallback">Terminal fallback<a class="anchor-link" href="#Terminal-fallback">&#182;</a></h2><p>The tasks are place in directories named <code>Task[1-3]</code>.</p>
 <p>Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description.</p>
 
@@ -13138,10 +13099,45 @@ div#notebook {
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Setup">Setup<a class="anchor-link" href="#Setup">&#182;</a></h2><p>This hands-on session requires of GCC 6.4.0. By loading the <code>sc18/handson2</code> module before invoking this Notebook, we took care of also loading GCC 6.4.0 into the environment.</p>
+<h2 id="Setup">Setup<a class="anchor-link" href="#Setup">&#182;</a></h2><p>We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[1]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>gcc --version
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>gcc (GCC) 9.2.0
+Copyright (C) 2019 Free Software Foundation, Inc.
+This is free software; see the source for copying conditions.  There is NO
+warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+
+</pre>
+</div>
+</div>
 
 </div>
 </div>
+
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
@@ -13149,17 +13145,19 @@ div#notebook {
 <h2 id="Tasks">Tasks<a name="top" /><a class="anchor-link" href="#Tasks">&#182;</a></h2><p>This session comes with multiple tasks, each one to be found in the respective sub-directory <code>Task[1-3]</code>. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.</p>
 <p>Please choose from the task below.</p>
 <ul>
-<li><p><a href="#task1">Task 1</a>: Compile Flags<br>
-Improve performance of the CPU Jacobi solver with compiler flags such as <code>Ofast</code> and profile-directed feedback (<a href="#solution0">Solution 1</a>)</p>
-</li>
-<li><p><a href="#task2">Task 2</a>: Software Prefetching<br>
-Improve performance of the CPU Jacobi solver with software prefetching (<a href="#solution1">Solution 2</a>)</p>
-</li>
-<li><p><a href="#task3">Task 3</a>: OpenMP<br>
-Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance (<a href="#solution2">Solution 3</a>)</p>
-</li>
-<li><p><a href="#survey">Suvery</a> Please remember to take the survey !</p>
-</li>
+<li><a href="#task1">Task 1</a>: <strong>Basic compiler optimization flags and compiler annotations</strong></li>
+</ul>
+<p>Improve performance of the CPU Jacobi solver with compiler flags such as <code>Ofast</code> and profile-directed feedback. Learn about compiler annotations.</p>
+<ul>
+<li><a href="#task2">Task 2</a>: <strong>Optimization via Prefetching controlled by compiler</strong></li>
+</ul>
+<p>Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance.</p>
+<ul>
+<li><a href="#task3">Task 3</a>: <strong>Optimization via OpenMP controlled by compiler and the system</strong></li>
+</ul>
+<p>Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance.</p>
+<ul>
+<li><a href="#survey">Suvery</a> Please remember to take the survey !</li>
 </ul>
 <h3 id="Make-Targets-">Make Targets <a name="make" /><a class="anchor-link" href="#Make-Targets-">&#182;</a></h3><p>For all tasks we have defined the following make targets.</p>
 <ul>
@@ -13184,11 +13182,13 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-1:-Compile-Flags-">Task 1: Compile Flags <a name="task1" /><a class="anchor-link" href="#Task-1:-Compile-Flags-">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver</p>
+<h2 id="Task-1:-Basic-compiler-optimization-flags-and-compiler-annotations-">Task 1: Basic compiler optimization flags and compiler annotations <a name="task1" /><a class="anchor-link" href="#Task-1:-Basic-compiler-optimization-flags-and-compiler-annotations-">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver</p>
 <p>Your task is to:</p>
 <ul>
 <li>Optimize performance with <code>-Ofast</code> flag</li>
+<li>Verify the cause for performance improvement by viewing perf profiles of O3 and Ofast binaries </li>
 <li>Optimize performance with profile directed feedback </li>
+<li>Generate compiler annotations/remarks to understand the optimizations done by the compiler with and without profile directed feedback </li>
 </ul>
 <p>First, change the working directory to <code>Task1</code>.</p>
 
@@ -13197,7 +13197,7 @@ build <code>poisson2d</code> binary (default)</li>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[1]:</div>
+<div class="prompt input_prompt">In&nbsp;[4]:</div>
 <div class="inner_cell">
     <div class="input_area">
 <div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> Task1
@@ -13217,7 +13217,7 @@ build <code>poisson2d</code> binary (default)</li>
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task1
+<pre>/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task1
 </pre>
 </div>
 </div>
@@ -13229,17 +13229,18 @@ build <code>poisson2d</code> binary (default)</li>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-A:--Ofast-vs.--O3">Part A: <code>-Ofast</code> vs. <code>-O3</code><a class="anchor-link" href="#Part-A:--Ofast-vs.--O3">&#182;</a></h3><p>We are to compare the performance of the binary being compiled with <code>-Ofast</code> optimization and with <code>-O3</code> optimization. Right now, the Makefile specifies <code>-O3</code> as the optimization flag. Compile the code using <code>make</code> and run it with <code>make run</code> in the next two cells.</p>
+<h3 id="Part-A:--Ofast-vs.--O3">Part A: <code>-Ofast</code> vs. <code>-O3</code><a class="anchor-link" href="#Part-A:--Ofast-vs.--O3">&#182;</a></h3><p>We are to compare the performance of the binary being compiled with <code>-Ofast</code> optimization and with <code>-O3</code> optimization. As in the previous task, we use a <code>Makefile</code> for compilation. The <code>Makefile</code> targets <code>poisson2d_O3</code> and <code>poisson2d_Ofast</code> are already prepared.</p>
+<p><strong>TASK</strong>: Add <code>-O3</code> as the optimization flag for the <code>poisson2d_O3</code> target by using the corresponding <code>CFLAGS</code> definition. There are notes relating to this Task 1 in the header of the <code>Makefile</code>. Compile the code using <code>make</code> as indicated below and run with the <code>Make</code> targets <code>run</code>, <code>run_perf</code> and <code>run_perf_recrep</code>.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[2]:</div>
+<div class="prompt input_prompt">In&nbsp;[84]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_O3
 </pre></div>
 
     </div>
@@ -13256,8 +13257,8 @@ build <code>poisson2d</code> binary (default)</li>
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm
-/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm
+<pre>gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d_reference.c -o poisson2d_reference.o  -lm
+gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d.c poisson2d_reference.o -o poisson2d -lm
 </pre>
 </div>
 </div>
@@ -13268,7 +13269,7 @@ build <code>poisson2d</code> binary (default)</li>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[3]:</div>
+<div class="prompt input_prompt">In&nbsp;[73]:</div>
 <div class="inner_cell">
     <div class="input_area">
 <div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
@@ -13288,25 +13289,24 @@ build <code>poisson2d</code> binary (default)</li>
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d
-Job &lt;5033&gt; is submitted to default queue &lt;batch&gt;.
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24897&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-1.13user 0.00system 0:01.15elapsed 97%CPU (0avgtext+0avgdata 10944maxresident)k
-2560inputs+0outputs (1major+264minor)pagefaults 0swaps
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k
+256inputs+0outputs (0major+480minor)pagefaults 0swaps
 </pre>
 </div>
 </div>
@@ -13318,20 +13318,17 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>You can use the GNU <em>perf</em> tool to profile the application using the <code>perf</code> command (see below) and see the top time-consuming functions.</p>
+<p>Let's have a look at the output of the <code>Makefile</code> target <code>run_perf</code>. It invokes the GNU <em>perf</em> tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to <em>perf</em>.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[4]:</div>
+<div class="prompt input_prompt">In&nbsp;[74]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># perf record creates a perf.data file </span>
-<span class="o">!</span>perf record -o perf.O3.data -e cycles ./poisson2d
-<span class="c1"># perf report opens the perf.data file </span>
-<span class="o">!</span>perf report -i perf.O3.data <span class="p">|</span> cat
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf
 </pre></div>
 
     </div>
@@ -13348,49 +13345,121 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d
+Job &lt;24898&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+
+ Performance counter stats for &#39;./poisson2d&#39;:
+
+       16264721613      cycles:u                                                    
+       28463907825      instructions:u            #    1.75  insn per cycle                                            
+
+       4.738444892 seconds time elapsed
+
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Next we run the makefile with target <code>run_perf_recrep</code> that prints the top routines of the application in terms of hotness by using a combination of <code>perf record ./app</code> and <code>perf report</code>.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[75]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># run_perf_recrep displays the top hot routines </span>
+<span class="o">!</span>make run_perf_recrep
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d
+Job &lt;24899&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-[ perf record: Woken up 1 times to write data ]
-[ perf record: Captured and wrote 0.172 MB perf.O3.data (4125 samples) ]
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+[ perf record: Woken up 3 times to write data ]
+[ perf record: Captured and wrote 0.739 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (19102 samples) ]
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio
+Job &lt;24900&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
-# Samples: 4K of event &#39;cycles:u&#39;
-# Event count (approx.): 3867635297
+# Samples: 19K of event &#39;cycles:u&#39;
+# Event count (approx.): 16254596654
 #
-# Overhead  Command    Shared Object      Symbol                                  
-# ........  .........  .................  ........................................
+# Overhead  Command    Shared Object  Symbol                                  
+# ........  .........  .............  ........................................
 #
-    72.02%  poisson2d  poisson2d          [.] 00000040.plt_call.fmax@@GLIBC_2.17
-    10.16%  poisson2d  poisson2d          [.] poisson2d_reference
-     9.99%  poisson2d  poisson2d          [.] main
-     4.69%  poisson2d  libc-2.17.so       [.] __memcpy_power7
-     2.23%  poisson2d  libm-2.17.so       [.] __fmaxf
-     0.75%  poisson2d  libm-2.17.so       [.] __exp_finite
-     0.07%  poisson2d  poisson2d          [.] 00000040.plt_call.memcpy@@GLIBC_2.17
-     0.02%  poisson2d  poisson2d          [.] check_results
-     0.02%  poisson2d  libm-2.17.so       [.] __GI___exp
-     0.01%  poisson2d  ld-2.17.so         [.] _dl_relocate_object
-     0.01%  poisson2d  [kernel.kallsyms]  [k] arch_local_irq_restore
-     0.00%  poisson2d  ld-2.17.so         [.] _dl_new_object
-     0.00%  poisson2d  ld-2.17.so         [.] _start
+    65.50%  poisson2d  poisson2d      [.] 00000038.plt_call.fmax@@GLIBC_2.17
+    21.21%  poisson2d  poisson2d      [.] main
+     9.18%  poisson2d  libc-2.17.so   [.] __memcpy_power7
+     3.28%  poisson2d  libm-2.17.so   [.] __fmaxf
+     0.74%  poisson2d  libm-2.17.so   [.] __exp_finite
+     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17
+     0.01%  poisson2d  libm-2.17.so   [.] __GI___exp
+     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253
+     0.01%  poisson2d  ld-2.17.so     [.] do_lookup_x
+     0.00%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x
+     0.00%  poisson2d  ld-2.17.so     [.] _dl_relocate_object
+     0.00%  poisson2d  ld-2.17.so     [.] strcmp
+     0.00%  poisson2d  ld-2.17.so     [.] _wordcopy_fwd_aligned
+     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start
+     0.00%  poisson2d  ld-2.17.so     [.] _start
 
 
 #
-# (Tip: Show user configuration overrides: perf config --user --list)
+# (Tip: Limit to show entries above 5% only: perf report --percent-limit 5)
 #
 </pre>
 </div>
@@ -13403,17 +13472,19 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p><strong>TASK</strong>: Now change the optimization flag in the <a href="/edit/Task1/Makefile">Makefile</a> to <code>-Ofast</code> and repeat the steps in the following cell. In case you follow along non-interactive, call <code>make</code> and <code>make run</code> in your shell. (If you are in the Jupyter Notebook, you can actually click the link of the <a href="/edit/Task1/Makefile">Makefile</a>. In other cases, use <code>vim</code> which is installed on Ascent.)</p>
+<p><strong>TASK</strong>: Now add the optimization flag <code>Ofast</code> to the <code>CFLAGS</code> for target <code>poisson2d_Ofast</code>. Compile the program with the target <code>poisson2d_Ofast</code> and run and analyse it as before with <code>run</code>, <code>run_perf</code> and <code>run_perf_recrep</code>.</p>
+<p>What difference do you see?</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[5]:</div>
+<div class="prompt input_prompt">In&nbsp;[76]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_Ofast 
+<span class="o">!</span>make run
 </pre></div>
 
     </div>
@@ -13430,8 +13501,25 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm
-/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm
+<pre>gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast   poisson2d.c poisson2d_reference.o -o poisson2d -lm
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24901&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+2.41user 0.00system 0:02.41elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k
+256inputs+0outputs (0major+480minor)pagefaults 0swaps
 </pre>
 </div>
 </div>
@@ -13439,13 +13527,21 @@ Calculate current execution.
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Again, run a <code>perf</code>-instrumented version:</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[6]:</div>
+<div class="prompt input_prompt">In&nbsp;[77]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf
 </pre></div>
 
     </div>
@@ -13462,25 +13558,30 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d
-Job &lt;5034&gt; is submitted to default queue &lt;batch&gt;.
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d
+Job &lt;24902&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-0.51user 0.00system 0:00.52elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k
-256inputs+0outputs (0major+264minor)pagefaults 0swaps
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+
+ Performance counter stats for &#39;./poisson2d&#39;:
+
+        8258991976      cycles:u                                                    
+       12013091172      instructions:u            #    1.45  insn per cycle                                            
+
+       2.408703909 seconds time elapsed
+
 </pre>
 </div>
 </div>
@@ -13488,16 +13589,21 @@ Calculate current execution.
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Generate the list of top routines in terms of hotness:</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[7]:</div>
+<div class="prompt input_prompt">In&nbsp;[78]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="c1"># perf record creates a perf.data file </span>
-<span class="o">!</span>perf record -o perf.Ofast.data -e cycles ./poisson2d
-<span class="c1"># perf report opens the perf.data file </span>
-<span class="o">!</span>perf report -i perf.Ofast.data <span class="p">|</span> cat
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf_recrep
 </pre></div>
 
     </div>
@@ -13514,45 +13620,58 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d
+Job &lt;24903&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-[ perf record: Woken up 1 times to write data ]
-[ perf record: Captured and wrote 0.086 MB perf.Ofast.data (1889 samples) ]
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+[ perf record: Woken up 2 times to write data ]
+[ perf record: Captured and wrote 0.382 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (9728 samples) ]
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio
+Job &lt;24904&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
 # To display the perf.data header info, please use --header/--header-only options.
 #
 #
 # Total Lost Samples: 0
 #
-# Samples: 1K of event &#39;cycles:u&#39;
-# Event count (approx.): 1765737747
+# Samples: 9K of event &#39;cycles:u&#39;
+# Event count (approx.): 8268811890
 #
-# Overhead  Command    Shared Object  Symbol                 
-# ........  .........  .............  .......................
+# Overhead  Command    Shared Object  Symbol                                  
+# ........  .........  .............  ........................................
 #
-    44.65%  poisson2d  poisson2d      [.] main
-    43.84%  poisson2d  poisson2d      [.] poisson2d_reference
-    10.28%  poisson2d  libc-2.17.so   [.] __memcpy_power7
-     1.12%  poisson2d  libm-2.17.so   [.] __exp_finite
-     0.05%  poisson2d  poisson2d      [.] check_results
-     0.03%  poisson2d  ld-2.17.so     [.] _dl_relocate_object
-     0.02%  poisson2d  libc-2.17.so   [.] __readdir64
-     0.01%  poisson2d  ld-2.17.so     [.] _dl_new_object
+    81.12%  poisson2d  poisson2d      [.] main
+    17.97%  poisson2d  libc-2.17.so   [.] __memcpy_power7
+     0.79%  poisson2d  libm-2.17.so   [.] __exp_finite
+     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17
+     0.02%  poisson2d  ld-2.17.so     [.] do_lookup_x
+     0.01%  poisson2d  libc-2.17.so   [.] vfprintf@@GLIBC_2.17
+     0.01%  poisson2d  libc-2.17.so   [.] _dl_addr
+     0.01%  poisson2d  ld-2.17.so     [.] _dl_relocate_object
+     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253
+     0.01%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x
+     0.01%  poisson2d  ld-2.17.so     [.] strcmp
+     0.00%  poisson2d  ld-2.17.so     [.] open_path
+     0.00%  poisson2d  ld-2.17.so     [.] init_tls
+     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start
      0.00%  poisson2d  ld-2.17.so     [.] _start
 
 
 #
-# (Tip: System-wide collection from all CPUs: perf record -a)
+# (Tip: For tracepoint events, try: perf report -s trace_fields)
 #
 </pre>
 </div>
@@ -13573,7 +13692,7 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h4 id="Interpretation">Interpretation<a class="anchor-link" href="#Interpretation">&#182;</a></h4><p>Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with <code>-Ofast</code> which enables <code>–ffast-math</code> option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the <code>-Ofast</code> binary natively implements the <code>fmax</code> function using instructions available in the hardware. The <code>-O3</code> binary makes a library call to compute <code>fmax</code> to follow a stricter <em>IEEE</em> requirement for accuracy.</p>
+<h4 id="Interpretation">Interpretation<a class="anchor-link" href="#Interpretation">&#182;</a></h4><p>Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with <code>-Ofast</code> which enables <code>–ffast-math</code> option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the <code>-Ofast</code> binary natively implements the <code>fmax</code> function using instructions available in the hardware. The <code>-O3</code> binary makes a library call to compute <code>fmax</code> to follow a stricter <em>IEEE</em> requirement for accuracy.</p>
 
 </div>
 </div>
@@ -13581,27 +13700,85 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-B:-Profile-directed-Feedback">Part B: Profile-directed Feedback<a class="anchor-link" href="#Part-B:-Profile-directed-Feedback">&#182;</a></h3><p>For the first level of optimization we saw <code>Ofast</code> cut the execution time of the <code>O3</code> binary by almost half.</p>
-<p>We can optimize the performance further by using profile directed feedback optimization.</p>
-<p>To compile using profile directed feedback with the GCC compiler we need to do the following steps</p>
+<h3 id="Part-B:-Profile-directed-Feedback">Part B: Profile-directed Feedback<a class="anchor-link" href="#Part-B:-Profile-directed-Feedback">&#182;</a></h3><p>For the first level of optimization we see that <code>Ofast</code> cut the execution time of the <code>O3</code> binary by almost half.</p>
+<p>We can optimize the performance further by using profile-directed feedback optimization.</p>
+<p>To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:</p>
 <ol>
-<li>We need to first build a training binary using <code>-fprofile-generate</code>; this instructs the compiler to record hot path information </li>
-<li>Run the training binary with a smaller input size; you should see a <code>.gcda</code> file generated which stores hot path information for further optimization by the compiler </li>
-<li>build the final binary using <code>-fprofile-use</code> which uses the profile information in the <code>.gcda</code> file </li>
-<li>Compare the performance of the final binary with the original <code>Ofast</code> binary </li>
+<li>Instrument binary;</li>
+<li>Run binary with training, gather profile information;</li>
+<li>Use profile information to generate optimized binary.</li>
 </ol>
-<p><strong>TASK</strong>: First, search for <code>TODO1</code> in the <a href="/edit/Task1/Makefile">Makefile</a>. It defines an additional compilation flag for <code>gcc</code>. Insert <code>-fprofile-generate=FOLDER</code> there with FOLDER pointing to <code>$$SC18_DIR_SCRATCH</code>, your personal write-directory (the double dollar signs are intentional as they are used to escape in the GNU Make syntax).</p>
-<p>After editing, run the following two cells to train your program.</p>
+<p>Step 1 is achieved by compiling the binary with the correct flag – <code>-fprofile-generate</code>. In our case, we need to specify an output location, which should be <code>$(SC19_DIR_SCRATCH)</code>.</p>
+<p>Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension <code>.gcda</code>) is written to the directory specified during compilation.</p>
+<p>For Step 3, the binary is once again compiled, but this time using the <code>gcda</code> profile just generated. The according flag is <code>-fprofile-use</code>, which we set to <code>$(SC19_DIR_SCRATCH)</code> as well.</p>
+<p>In our <code>Makefile</code> at hand, we prepared the steps already for you in the form of two targets.</p>
+<ul>
+<li><code>poisson2d_train</code>: Will compile the binary with profile-directed feedback</li>
+<li><code>poisson2d_ref</code>: Will take a generated profile and compile a new, optimized binary</li>
+</ul>
+<p>By using dependencies, between these two targets a profile run is launched.</p>
+<p><strong>TASK</strong>: Edit the <a href="`Makefile`">Makefile</a> and add the <code>-fprofile-*</code> flags to the <code>CFLAGS</code> of <code>poisson2d_train</code> and
+<code>poisson2d_ref</code> as outline in the file.</p>
+<p>After that, you may launch them with the following cells (<code>gen_profile</code> is a meta-target and uses <code>poisson2d_train</code> and <code>poisson2d_ref</code>). If you need to clean the generated profile, you may use <code>make clean_profile</code>.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[79]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make gen_profile
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm 
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100
+Job &lt;24905&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh
+Calculate current execution.
+    0, 0.249490
+echo `date` &gt; /gpfs/wolf/trn003/scratch/aherten//.profile_generated
+gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_ref -lm 
+cp poisson2d_ref poisson2d
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[8]:</div>
+<div class="prompt input_prompt">In&nbsp;[80]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_train
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
 </pre></div>
 
     </div>
@@ -13618,8 +13795,24 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  &#34;-fprofile-generate=$SC18_DIR_SCRATCH&#34; -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm
-/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  &#34;-fprofile-generate=$SC18_DIR_SCRATCH&#34; -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_train  -lm
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24906&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+2.28user 0.01system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k
+256inputs+0outputs (0major+479minor)pagefaults 0swaps
 </pre>
 </div>
 </div>
@@ -13627,13 +13820,29 @@ Calculate current execution.
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Great! It is! In our tests, this shaved off another 5%.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Let's also measure instructions and cycles</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[9]:</div>
+<div class="prompt input_prompt">In&nbsp;[81]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_train
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_perf
 </pre></div>
 
     </div>
@@ -13650,20 +13859,30 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_train 200 64 64
-Job &lt;5035&gt; is submitted to default queue &lt;batch&gt;.
+<pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d
+Job &lt;24907&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 200 iterations on 64 x 64 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.248743
-  100, 0.124046
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.248743
-  100, 0.124046
-0.00user 0.00system 0:00.10elapsed 5%CPU (0avgtext+0avgdata 5248maxresident)k
-512inputs+0outputs (0major+115minor)pagefaults 0swaps
-mv $SC18_DIR_SCRATCH/*.gcda .
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+
+ Performance counter stats for &#39;./poisson2d&#39;:
+
+        7925983538      cycles:u                                                    
+       12253080719      instructions:u            #    1.55  insn per cycle                                            
+
+       2.313471365 seconds time elapsed
+
 </pre>
 </div>
 </div>
@@ -13675,19 +13894,27 @@ mv $SC18_DIR_SCRATCH/*.gcda .
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>Now, a <code>.gcda</code> file exists in the directory which can be used for an profile-accelerated subsequent run.</p>
-<p><strong>TASK</strong>: Edit the <a href="/edit/Task1/Makefile">Makefile</a> again, this time modifying <code>TODO2</code> to be equivalent to <code>-fprofile-use</code>. A directory is not needed as we copied the gcda file into the current directory.</p>
-<p>Run the following cells in order to build using the newly added flag and then run with the profile-accelerated version.</p>
+<p>What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-C:-Compiler-annotations/Remarks">Part C: Compiler annotations/Remarks<a class="anchor-link" href="#Part-C:-Compiler-annotations/Remarks">&#182;</a></h3><p>Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done.</p>
+<p>To generate compiler annotations using GCC, one uses <code>-fopt-info-all</code>. If you only want to see the missed options, use the option <code>-fopt-info-missed</code> instead of <code>-fopt-info-all</code>. See also the <a href="https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info">documentation of GCC regarding the flag</a>.</p>
+<p><strong>TASK</strong>: Have a looK at the <code>CFLAGS</code> of the <code>Makefile</code> target <code>poisson2d_Ofast_info</code>. Add the flag <code>-fopt-info-all</code> to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use – for example – <code>-fopt-info-all=(SC19_DIR_SCRATCH)/filename</code>.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[10]:</div>
+<div class="prompt input_prompt">In&nbsp;[82]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_profile
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_Ofast_info
 </pre></div>
 
     </div>
@@ -13704,8 +13931,74 @@ mv $SC18_DIR_SCRATCH/*.gcda .
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  &#34;-fprofile-use&#34; -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm
-/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  &#34;-fprofile-use&#34; -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_profile  -lm
+<pre>gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all poisson2d.c poisson2d_reference.o -o poisson2d_Ofast_info  -lm
+poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:161:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:159:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:158:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:142:31: missed:   not inlinable: main/33 -&gt; printf/36, function body not available
+poisson2d.c:103:5: missed:   not inlinable: main/33 -&gt; __builtin_puts/37, function body not available
+poisson2d.c:96:5: missed:   not inlinable: main/33 -&gt; printf/36, function body not available
+poisson2d.c:78:29: missed:   not inlinable: main/33 -&gt; exp/35, function body not available
+poisson2d.c:68:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+poisson2d.c:67:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+poisson2d.c:65:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+Unit growth for small function inlining: 207-&gt;207 (0%)
+
+Inlined 4 calls, eliminated 0 functions
+
+consider run-time aliasing test between *_84 and *_87
+consider run-time aliasing test between *_92 and *_97
+consider run-time aliasing test between *_104 and *_107
+consider run-time aliasing test between *_111 and *_115
+poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.
+poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.
+poisson2d.c:108:25: missed: couldn&#39;t vectorize loop
+poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized
+poisson2d.c:136:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:136:9: missed: Loop costings may not be worthwhile.
+poisson2d.c:131:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:131:9: missed: Loop costings may not be worthwhile.
+poisson2d.c:122:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);
+poisson2d.c:112:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:112:9: missed: not vectorized: control flow in loop.
+poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors
+poisson2d.c:88:5: missed: couldn&#39;t vectorize loop
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);
+poisson2d.c:72:5: missed: couldn&#39;t vectorize loop
+poisson2d.c:78:27: missed: not vectorized: complicated access pattern.
+poisson2d.c:74:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);
+poisson2d.c:43:5: note: vectorized 1 loops in function.
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);
+poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);
+poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);
+poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);
+poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);
+poisson2d.c:96:5: missed: statement clobbers memory: printf (&#34;Jacobi relaxation calculation: max %d iterations on %d x %d mesh\n&#34;, iter_max_130, ny_139, nx_195);
+poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&amp;&#34;Calculate current execution.&#34;[0]);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);
+poisson2d.c:142:31: missed: statement clobbers memory: printf (&#34;%5d, %0.6f\n&#34;, iter_237, error_219);
+poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);
+poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);
+poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);
+poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);
+poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);
+poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);
 </pre>
 </div>
 </div>
@@ -13713,13 +14006,23 @@ mv $SC18_DIR_SCRATCH/*.gcda .
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.</p>
+<p><strong>TASK</strong>: 
+Adapt the <code>CFLAGS</code> of <code>poisson2d_ref_info</code> to include <code>-fopt-info-all</code> <strong>and</strong> the profile input of <code>-fprofile-use=…</code> here. <em>(Be advised: Long output!)</em></p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[11]:</div>
+<div class="prompt input_prompt">In&nbsp;[83]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_profile
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_ref_info
 </pre></div>
 
     </div>
@@ -13736,25 +14039,220 @@ mv $SC18_DIR_SCRATCH/*.gcda .
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_profile
-Job &lt;5036&gt; is submitted to default queue &lt;batch&gt;.
+<pre>gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm 
+poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).
+Increasing alignment of decl: __gcov0.main
+poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_D_00100_1_main/48 -&gt; __gcov_exit/55, function body not available
+poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_I_00100_0_main/47 -&gt; __gcov_init/54, function body not available
+poisson2d.c:161:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:159:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:158:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:142:31: missed:   not inlinable: main/33 -&gt; printf/36, function body not available
+poisson2d.c:103:5: missed:   not inlinable: main/33 -&gt; __builtin_puts/37, function body not available
+poisson2d.c:96:5: missed:   not inlinable: main/33 -&gt; printf/36, function body not available
+poisson2d.c:78:29: missed:   not inlinable: main/33 -&gt; exp/35, function body not available
+poisson2d.c:68:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+poisson2d.c:67:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+poisson2d.c:65:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+Unit growth for small function inlining: 295-&gt;295 (0%)
+
+Inlined 4 calls, eliminated 0 functions
+
+consider run-time aliasing test between *_84 and *_87
+consider run-time aliasing test between *_92 and *_97
+consider run-time aliasing test between *_104 and *_107
+consider run-time aliasing test between *_111 and *_115
+poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.
+poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);
+poisson2d.c:108:25: missed: couldn&#39;t vectorize loop
+poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized
+poisson2d.c:136:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:136:9: missed: Loop costings may not be worthwhile.
+poisson2d.c:131:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:131:9: missed: Loop costings may not be worthwhile.
+poisson2d.c:122:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:122:9: missed: not vectorized: control flow in loop.
+poisson2d.c:112:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:112:9: missed: not vectorized: control flow in loop.
+poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors
+poisson2d.c:88:5: missed: couldn&#39;t vectorize loop
+poisson2d.c:88:5: missed: not vectorized: control flow in loop.
+poisson2d.c:72:5: missed: couldn&#39;t vectorize loop
+poisson2d.c:72:5: missed: not vectorized: control flow in loop.
+poisson2d.c:74:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);
+poisson2d.c:43:5: note: vectorized 1 loops in function.
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);
+poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);
+poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);
+poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);
+poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);
+poisson2d.c:96:5: missed: statement clobbers memory: printf (&#34;Jacobi relaxation calculation: max %d iterations on %d x %d mesh\n&#34;, iter_max_337, ny_124, nx_286);
+poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&amp;&#34;Calculate current execution.&#34;[0]);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);
+poisson2d.c:142:31: missed: statement clobbers memory: printf (&#34;%5d, %0.6f\n&#34;, iter_316, error_118);
+poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_127);
+poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_311);
+poisson2d.c:161:5: missed: statement clobbers memory: free (A_122);
+poisson2d.c:65:41: missed: statement clobbers memory: A_129 = malloc (8000000);
+poisson2d.c:67:41: missed: statement clobbers memory: Anew_132 = malloc (8000000);
+poisson2d.c:68:41: missed: statement clobbers memory: rhs_140 = malloc (8000000);
+poisson2d.c:136:9: note: considering unrolling loop 7 at BB 53
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)
+poisson2d.c:131:9: note: considering unrolling loop 6 at BB 50
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)
+poisson2d.c:122:9: note: considering unrolling loop 5 at BB 47
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:122:9: optimized: loop unrolled 3 times (header execution count 9800)
+poisson2d.c:118:25: note: considering unrolling loop 13 at BB 33
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)
+poisson2d.c:118:25: note: considering unrolling loop 9 at BB 30
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:112:9: note: considering unrolling loop 14 at BB 42
+poisson2d.c:43:5: note: considering unrolling loop 4 at BB 40
+poisson2d.c:108:25: note: considering unrolling loop 3 at BB 60
+poisson2d.c:88:5: note: considering unrolling loop 2 at BB 23
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:88:5: optimized: loop unrolled 3 times (header execution count 100)
+poisson2d.c:74:9: note: considering unrolling loop 11 at BB 12
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)
+poisson2d.c:72:5: note: considering unrolling loop 1 at BB 16
+poisson2d.c:164:1: missed: statement clobbers memory: __gcov_init (&amp;*.LPBX0);
+poisson2d.c:164:1: missed: statement clobbers memory: __gcov_exit ();
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100
+Job &lt;24908&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+libgcov profiling error:/gpfs/wolf/trn003/scratch/aherten//#autofs#nccsopen-svm1_home#aherten#SC19-Tutorial#3-Optimizing_POWER#Handson#Task1#poisson2d.gcda:overwriting an existing profile data with a different timestamp
+Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-0.47user 0.00system 0:00.48elapsed 98%CPU (0avgtext+0avgdata 10816maxresident)k
-256inputs+0outputs (0major+265minor)pagefaults 0swaps
+    0, 0.249490
+echo `date` &gt; /gpfs/wolf/trn003/scratch/aherten//.profile_generated
+gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c poisson2d_reference.o -o poisson2d_ref_info  -lm
+poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).
+poisson2d.c:161:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:159:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:158:5: missed:   not inlinable: main/33 -&gt; free/38, function body not available
+poisson2d.c:142:31: missed:   not inlinable: main/33 -&gt; printf/36, function body not available
+poisson2d.c:103:5: missed:   not inlinable: main/33 -&gt; __builtin_puts/37, function body not available
+poisson2d.c:96:5: missed:   not inlinable: main/33 -&gt; printf/36, function body not available
+poisson2d.c:78:29: missed:   not inlinable: main/33 -&gt; exp/35, function body not available
+poisson2d.c:68:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+poisson2d.c:67:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+poisson2d.c:65:41: missed:   not inlinable: main/33 -&gt; malloc/34, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -&gt; strtol/39, function body not available
+Unit growth for small function inlining: 207-&gt;207 (0%)
+
+Inlined 4 calls, eliminated 0 functions
+
+consider run-time aliasing test between *_84 and *_87
+consider run-time aliasing test between *_92 and *_97
+consider run-time aliasing test between *_104 and *_107
+consider run-time aliasing test between *_111 and *_115
+poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.
+poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.
+poisson2d.c:108:25: missed: couldn&#39;t vectorize loop
+poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized
+poisson2d.c:136:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:136:9: missed: Loop costings may not be worthwhile.
+poisson2d.c:131:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:131:9: missed: Loop costings may not be worthwhile.
+poisson2d.c:122:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);
+poisson2d.c:112:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:112:9: missed: not vectorized: control flow in loop.
+poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors
+poisson2d.c:88:5: missed: couldn&#39;t vectorize loop
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);
+poisson2d.c:72:5: missed: couldn&#39;t vectorize loop
+poisson2d.c:78:27: missed: not vectorized: complicated access pattern.
+poisson2d.c:74:9: missed: couldn&#39;t vectorize loop
+poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);
+poisson2d.c:43:5: note: vectorized 1 loops in function.
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);
+poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);
+/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);
+poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);
+poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);
+poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);
+poisson2d.c:96:5: missed: statement clobbers memory: printf (&#34;Jacobi relaxation calculation: max %d iterations on %d x %d mesh\n&#34;, iter_max_130, ny_139, nx_195);
+poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&amp;&#34;Calculate current execution.&#34;[0]);
+poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);
+poisson2d.c:142:31: missed: statement clobbers memory: printf (&#34;%5d, %0.6f\n&#34;, iter_237, error_219);
+poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);
+poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);
+poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);
+poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);
+poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);
+poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);
+poisson2d.c:136:9: note: considering unrolling loop 7 at BB 47
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)
+poisson2d.c:131:9: note: considering unrolling loop 6 at BB 44
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)
+poisson2d.c:122:9: note: considering unrolling loop 5 at BB 40
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:122:9: optimized: loop unrolled 7 times (header execution count 9701)
+poisson2d.c:118:25: note: considering unrolling loop 13 at BB 27
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)
+poisson2d.c:118:25: note: considering unrolling loop 9 at BB 24
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:112:9: note: considering unrolling loop 14 at BB 37
+poisson2d.c:43:5: note: considering unrolling loop 4 at BB 35
+poisson2d.c:108:25: note: considering unrolling loop 3 at BB 51
+poisson2d.c:88:5: note: considering unrolling loop 2 at BB 18
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:88:5: optimized: loop unrolled 7 times (header execution count 99)
+poisson2d.c:74:9: note: considering unrolling loop 11 at BB 9
+considering unrolling loop with constant number of iterations
+considering unrolling loop with runtime-computable number of iterations
+poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)
+poisson2d.c:72:5: note: considering unrolling loop 1 at BB 14
 </pre>
 </div>
 </div>
@@ -13766,7 +14264,11 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)</p>
+<p>Comparing the annotations generated of a plain <code>-Ofast</code> optimization level and the one generated at <code>-Ofast</code> and profile directed feedback, we observe that many more optimizations are possible due to profile information.</p>
+<p>For instance you will see annotations such as</p>
+
+<pre><code>poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)</code></pre>
+<p>The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations.</p>
 
 </div>
 </div>
@@ -13794,15 +14296,19 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-2:-Software-Pretechting">Task 2:<a name="task2" /> Software Pretechting<a class="anchor-link" href="#Task-2:-Software-Pretechting">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>Study the difference of program execution time of different optimization levels with and without software prefetching.</p>
-<p>First, change directory to that of Task 2</p>
+<h2 id="Task-2:-Impact-of-Prefetching-on-Performance">Task 2:<a name="task2" /> Impact of Prefetching on Performance<a class="anchor-link" href="#Task-2:-Impact-of-Prefetching-on-Performance">&#182;</a></h2><h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><ul>
+<li>Study the difference of program execution time of different optimization levels with and without software prefetching.</li>
+<li>Verify the impact by measuring cache counters with and without prefetching.</li>
+<li>Learn how to modify contents of DSCR (<em>Data Stream Control Register</em>) using IBM XL compiler and study the impact with different values to DSCR. </li>
+</ul>
+<p>But first, lets change directory to that of Task 2</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[12]:</div>
+<div class="prompt input_prompt">In&nbsp;[85]:</div>
 <div class="inner_cell">
     <div class="input_area">
 <div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> ../Task2
@@ -13822,7 +14328,7 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task2
+<pre>/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task2
 </pre>
 </div>
 </div>
@@ -13834,25 +14340,28 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-A:-Running">Part A: Running<a class="anchor-link" href="#Part-A:-Running">&#182;</a></h3>
+<h3 id="Part-A:-Software-Prefetching">Part A: Software Prefetching<a class="anchor-link" href="#Part-A:-Software-Prefetching">&#182;</a></h3>
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>Look at the <a href="/edit/Task2/Makefile">Makefile</a> and work on the TODOs. Please implement compile flags as mentioned in the Makefile target name.</p>
-<p>Afterwards, compile each target with the following cells and submit them to the batch system. Follow along accordingly in the non-interactive version of this Notebook.</p>
+<p><strong>TASK</strong>: Look at the Makefile and work on the TODOs.</p>
+<ul>
+<li>First generate a <code>-Ofast</code>-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!</li>
+<li>Modify the <code>Makefile</code> to add the option for software prefetching (<code>-fprefetch-loop-arrays</code>). Compare performance of <code>-Ofast</code> with and without software prefetching</li>
+</ul>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[13]:</div>
+<div class="prompt input_prompt">In&nbsp;[97]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_o3_pref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make clean
 </pre></div>
 
     </div>
@@ -13869,27 +14378,7 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm
-/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_pref  -lm
-bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_pref
-Job &lt;5037&gt; is submitted to default queue &lt;batch&gt;.
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
-&lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-1.12user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10880maxresident)k
-256inputs+0outputs (0major+265minor)pagefaults 0swaps
+<pre>rm -f poisson2d poisson2d*.o
 </pre>
 </div>
 </div>
@@ -13900,10 +14389,12 @@ Calculate current execution.
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[14]:</div>
+<div class="prompt input_prompt">In&nbsp;[88]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_ofast_pref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d <span class="nv">CC</span><span class="o">=</span>gcc
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
@@ -13920,26 +14411,49 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_pref  -lm
-bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_pref
-Job &lt;5038&gt; is submitted to default queue &lt;batch&gt;.
+<pre>make: `poisson2d&#39; is up to date.
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24911&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+2.39user 0.01system 0:02.40elapsed 100%CPU (0avgtext+0avgdata 24256maxresident)k
+0inputs+0outputs (0major+480minor)pagefaults 0swaps
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d
+Job &lt;24912&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-0.77user 0.00system 0:00.77elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k
-256inputs+0outputs (0major+264minor)pagefaults 0swaps
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+
+ Performance counter stats for &#39;./poisson2d&#39;:
+
+        8271503902      cycles:u                                                    
+         481152478      r168a4:u                                                    
+
+       2.412224884 seconds time elapsed
+
 </pre>
 </div>
 </div>
@@ -13950,10 +14464,12 @@ Calculate current execution.
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[15]:</div>
+<div class="prompt input_prompt">In&nbsp;[98]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_o3_nopref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_pref <span class="nv">CC</span><span class="o">=</span>gcc
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
@@ -13970,26 +14486,50 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_nopref  -lm
-bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_nopref
-Job &lt;5039&gt; is submitted to default queue &lt;batch&gt;.
+<pre>gcc -std=c99 -DUSE_DOUBLE -Ofast -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm
+cp poisson2d_pref poisson2d
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24919&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+1.92user 0.00system 0:01.93elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k
+256inputs+0outputs (0major+480minor)pagefaults 0swaps
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d
+Job &lt;24920&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-1.13user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10944maxresident)k
-256inputs+0outputs (0major+266minor)pagefaults 0swaps
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+
+ Performance counter stats for &#39;./poisson2d&#39;:
+
+        6586609284      cycles:u                                                    
+         459879452      r168a4:u                                                    
+
+       1.925399505 seconds time elapsed
+
 </pre>
 </div>
 </div>
@@ -13997,13 +14537,23 @@ Calculate current execution.
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p><strong>TASK</strong>: Repeat the experiment with the <code>-O3</code> flag. Have a look at the <code>Makefile</code> and the outlined TODO. There's a position to easily adapt <code>-Ofast</code>→<code>-O3</code>!</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[16]:</div>
+<div class="prompt input_prompt">In&nbsp;[100]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run_ofast_nopref
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d <span class="nv">CC</span><span class="o">=</span>gcc -B
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
@@ -14020,59 +14570,65 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_nopref  -lm
-bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_nopref
-Job &lt;5040&gt; is submitted to default queue &lt;batch&gt;.
+<pre>gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24923&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-0.82user 0.00system 0:00.82elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k
-256inputs+0outputs (0major+265minor)pagefaults 0swaps
-</pre>
-</div>
-</div>
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k
+256inputs+0outputs (0major+479minor)pagefaults 0swaps
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d
+Job &lt;24924&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
 
-</div>
-</div>
+ Performance counter stats for &#39;./poisson2d&#39;:
 
-</div>
-<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div><div class="inner_cell">
-<div class="text_cell_render border-box-sizing rendered_html">
-<p>Do you notice the impact difference with optimization levels? It's always important to carefully study the interplay of flags.</p>
+       16445764669      cycles:u                                                    
+         645094089      r168a4:u                                                    
 
+       4.792567763 seconds time elapsed
+
+</pre>
 </div>
 </div>
-</div>
-<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div><div class="inner_cell">
-<div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-B:-Analysis-of-Instructions">Part B: Analysis of Instructions<a class="anchor-link" href="#Part-B:-Analysis-of-Instructions">&#182;</a></h3><p>Compilation with the software prefetching flag causes the compiler to generate the <code>__dcbt</code> and <code>__dcbtst</code>  instructions that prefetch memory values to L3.</p>
-<p>Verify it using <code>objdump -lSd</code> on each file (<code>poisson2d_o3_pref</code>, <code>poisson2d_ofast_pref</code>, <code>poisson2d_o3_nopref</code>, <code>poisson2d_ofast_nopref</code>). You might want to grep for <code>dcb</code>.</p>
 
 </div>
 </div>
+
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[19]:</div>
+<div class="prompt input_prompt">In&nbsp;[101]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;poisson2d_o3_pref&quot;</span><span class="p">,</span> <span class="s2">&quot;poisson2d_ofast_pref&quot;</span><span class="p">,</span> <span class="s2">&quot;poisson2d_o3_nopref&quot;</span><span class="p">,</span> <span class="s2">&quot;poisson2d_ofast_nopref&quot;</span><span class="p">]:</span>
-    <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;</span><span class="si">{}</span><span class="s2">:&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">f</span><span class="p">))</span>
-    <span class="n">objdump</span> <span class="o">-</span><span class="n">lSd</span> <span class="err">$</span><span class="n">f</span> <span class="o">|</span> <span class="n">grep</span> <span class="n">dcb</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_pref <span class="nv">CC</span><span class="o">=</span>gcc -B
+<span class="o">!</span>make run
+<span class="o">!</span>make l3missstats
 </pre></div>
 
     </div>
@@ -14089,27 +14645,809 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>poisson2d_o3_pref:
-poisson2d_ofast_pref:
-    10000da0:	ec f1 00 7c 	dcbtst  0,r30
-    10000da4:	2c fa 00 7c 	dcbt    0,r31
-    10000da8:	2c 62 00 7c 	dcbt    0,r12
-    10000dac:	2c b2 00 7c 	dcbt    0,r22
-    10000dcc:	2c e2 00 7c 	dcbt    0,r28
-    10000dd0:	2c ea 00 7c 	dcbt    0,r29
-    100010b4:	2c 62 00 7c 	dcbt    0,r12
-    100010b8:	2c 5a 00 7c 	dcbt    0,r11
-    100010c4:	ec 19 00 7c 	dcbtst  0,r3
-    100010cc:	2c 22 00 7c 	dcbt    0,r4
-    100010d0:	2c ea 00 7c 	dcbt    0,r29
-    100010d4:	2c f2 00 7c 	dcbt    0,r30
-    100010dc:	2c fa 00 7c 	dcbt    0,r31
-poisson2d_o3_nopref:
-poisson2d_ofast_nopref:
-</pre>
-</div>
-</div>
-
+<pre>gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24925&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+4.74user 0.00system 0:04.74elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k
+0inputs+0outputs (0major+480minor)pagefaults 0swaps
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d
+Job &lt;24926&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+
+ Performance counter stats for &#39;./poisson2d&#39;:
+
+       16239159454      cycles:u                                                    
+         631061431      r168a4:u                                                    
+
+       4.730144897 seconds time elapsed
+
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Observing the results, we see that SW Prefetching seems to help at <code>-Ofast</code> but not at <code>-O3</code>. We can use the steps described in the the next section to verify that the compiler has not inserted any SW prefetch operations at<code>-O3</code> at all. That is because in the <code>-O3</code> binary the time is dominated by <code>__fmax</code> call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of <code>fmax()</code>
+GCC may add further loop optimizations such as unrolling upon invocation of <code>–fprefetch-loop-arrays</code>.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-B:-Analysis-of-Instructions">Part B: Analysis of Instructions<a class="anchor-link" href="#Part-B:-Analysis-of-Instructions">&#182;</a></h3><p>Compilation of the <code>-Ofast</code> binary with the software prefetching flag causes the compiler to generate the <code>dcb*</code>  instructions that prefetch memory values to L3.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p><strong>TASK</strong>: 
+Run <code>$(SC19_SUBMIT_CMD) objdump -lSd</code> on each binary file (<code>-O3</code>, <code>-Ofast</code> with prefetch/no prefetch).
+Look for instructions beginning with <code>dcb</code>
+At what optimization levels does the compiler generate software prefetching instructions?</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[114]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make <span class="nv">CC</span><span class="o">=</span>gcc -B poisson2d_pref
+<span class="o">!</span>objdump -lSd ./poisson2d_pref &gt; poisson2d.dis
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>gcc -std=c99 -DUSE_DOUBLE -Ofast   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[116]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>grep dcb poisson2d.dis
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>    10000b28:	2c d2 00 7c 	dcbt    0,r26
+    10000b30:	2c ba 00 7c 	dcbt    0,r23
+    10000b38:	2c b2 00 7c 	dcbt    0,r22
+    10000b50:	2c d2 00 7c 	dcbt    0,r26
+    10000b58:	ec b9 00 7c 	dcbtst  0,r23
+    10000b80:	2c d2 00 7c 	dcbt    0,r26
+    10000e64:	2c 92 00 7c 	dcbt    0,r18
+    10000e68:	2c 9a 00 7c 	dcbt    0,r19
+    10000e6c:	2c a2 00 7c 	dcbt    0,r20
+    10000e70:	2c aa 00 7c 	dcbt    0,r21
+    10000e7c:	2c b2 00 7c 	dcbt    0,r22
+    10000e80:	2c d2 00 7c 	dcbt    0,r26
+    10000e94:	ec b9 00 7c 	dcbtst  0,r23
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-C:-Changing-Values-of-DSCR-via-compiler-flags">Part C: Changing Values of DSCR via compiler flags<a class="anchor-link" href="#Part-C:-Changing-Values-of-DSCR-via-compiler-flags">&#182;</a></h3><p>This task requires using the IBM XL compiler. It should be already in your environment.</p>
+<p>We saw the impact of software prefetching in the previous subsection. 
+In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. 
+In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching.</p>
+<p>IBM XL compiler has an option <code>-qprefetch=dscr=&lt;val&gt;</code> that can be used for this purpose.
+Compiling with <code>-qprefetch=dscr=1</code> turns off the prefetcher. One can give various values such as <code>-qprefetch=dscr=4</code>, <code>-qprefetch=dscr=7</code> etc. to control aggressiveness of prefetching.</p>
+<p>For this exercise we use <code>make CC=xlc_r</code> to illustrate the performance impact.</p>
+<p><strong>Task</strong> Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the <code>Makefile</code>: Add <code>qprefetch=dscr=1</code> to the <code>CFLAGS</code> and rebuild the application and note the performance. Which one is faster?</p>
+<p>In general, applications benefit with the default settings of hardware DSCR register (<code>-qprefetch=dscr=0</code>). However, certain applications also benefit with prefetching turned off.</p>
+<p>It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Measure performance of the application compiled with XL at default DSCR value</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[117]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make <span class="nv">CC</span><span class="o">=</span>xlc_r -B poisson2d
+<span class="o">!</span>make run
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  poisson2d.c -o poisson2d  -lm
+    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24927&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 50.149062
+  200, 99.849327
+  300, 149.352369
+  400, 198.659746
+  500, 247.773000
+  600, 296.693652
+  700, 345.423208
+  800, 393.963155
+  900, 442.314962
+2.26user 0.00system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k
+256inputs+0outputs (0major+477minor)pagefaults 0swaps
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Measure performance of the application compiled with XL with DSCR value turned off</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[9]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d_dscr <span class="nv">CC</span><span class="o">=</span>xlc_r -B
+<span class="o">!</span>make run
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  -qprefetch=dscr=1 poisson2d.c -o poisson2d_dscr  -lm
+    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d
+Job &lt;24929&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+4.58user 0.00system 0:04.59elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k
+0inputs+0outputs (0major+476minor)pagefaults 0swaps
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher?</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>The DSCR register controls the operation of the HW Prefetcher on POWER9. It can be modified in the command line by <code>ppc64_cpu --dscr=&lt;value&gt;</code>. However this needs admin privileges. IBM XL offers a compiler flag to set the value through the compiler. <code>-qprefetch=dscr=1</code> turns off the prefetcher. Observing the results we see that the performance without the HW prefetcher is twice as bad as that with default prefetching. So we can conclude that Prefetching helps the Jacobi application.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h4 id="References">References<a class="anchor-link" href="#References">&#182;</a></h4><ol>
+<li><a href="https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html">https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html</a></li>
+<li><a href="https://www.gnu.org/software/gcc/projects/prefetch.html">https://www.gnu.org/software/gcc/projects/prefetch.html</a></li>
+<li><a href="https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0">https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0</a></li>
+</ol>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p><a href="#top">Back to Top</a></p>
+<hr>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h2 id="Task-3:-OpenMP">Task 3: OpenMP<a class="anchor-link" href="#Task-3:-OpenMP">&#182;</a></h2><p><a name="task3"></a></p>
+<h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.
+First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is <code>poisson2d_reference</code>. We parallelize only the main loop but not <code>poisson2d_reference</code>. The speedup is the performance gain seen in the main loop as compared to the reference loop.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[10]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> ../Task3
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task3
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-A:-Implement-OpenMP-Pragmas;-Compilation">Part A: Implement OpenMP Pragmas; Compilation<a class="anchor-link" href="#Part-A:-Implement-OpenMP-Pragmas;-Compilation">&#182;</a></h3><p><strong>Task</strong>: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.</p>
+<ul>
+<li><strong>Directives</strong>: Look at the TODOs in <a href="poisson2d.c"><code>poisson2d.c</code></a> to add OpenMP parallelism. The pragmas in question are <code>#pragma  omp parallel for</code> (and once it's <code>#pragma omp parallel for reduction(max:error)</code> – can you guess where?)</li>
+<li><strong>Compilation</strong>: Please add compilation flags enabling OpenMP in GCC and XL to the <code>Makefile</code>. For GCC, we need to add <code>-fopenmp</code> and the application needs to be linked with <code>-lgomp</code>. For XL, we need to add <code>-qsmp=omp</code> to the list of compilation flags. </li>
+</ul>
+<p>Afterwards, compile and run the application with the following commands.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[39]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d <span class="nv">CC</span><span class="o">=</span>gcc
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>gcc -c -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp   poisson2d_reference.c -o poisson2d_reference.o -lm
+gcc -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp  poisson2d.c poisson2d_reference.o -o poisson2d  -lm 
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>The command to submit a job to the batch system is prepared in an environment variable <code>$SC19_SUBMIT_CMD</code>; use it together with <code>eval</code>. In the following cell, it is shown how to invoke the application using the batch system.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[40]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC19_SUBMIT_CMD</span> ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>Job &lt;24951&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
+Calculate reference solution and time with serial CPU execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+Calculate current execution.
+    0, 0.249995
+  100, 0.248997
+  200, 0.248007
+  300, 0.247025
+  400, 0.246050
+  500, 0.245084
+  600, 0.244124
+  700, 0.243173
+  800, 0.242228
+  900, 0.241291
+1000x1000: Ref:   4.7430 s, This:   3.9363 s, speedup:     1.20
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Inorder to run the parallel application, we need to set the number of threads using <code>OMP_NUM_THREADS</code>
+What is the best performance you can reach by setting the number of threads via <code>OMP_NUM_THREADS=N</code> with <code>N</code> being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.<br>
+We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler of Ascent, from overlaying binding options. Also, we use <code>-c ALL_CPUS</code> to make all CPUs on the compute nodes available to you.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[41]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">1</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   4.7288 s, This:   4.9791 s, speedup:     0.95
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[42]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">2</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   4.7125 s, This:   2.4914 s, speedup:     1.89
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[35]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.1065 s, This:   1.3836 s, speedup:     1.52
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[21]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">8</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3868 s, This:   0.5272 s, speedup:     4.53
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[22]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">10</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3912 s, This:   0.4612 s, speedup:     5.18
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[23]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">20</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3864 s, This:   0.4037 s, speedup:     5.91
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[24]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">40</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3773 s, This:   0.3045 s, speedup:     7.81
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[25]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">80</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3819 s, This:   0.3081 s, speedup:     7.73
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<h3 id="Part-B:-Bindings">Part B: Bindings<a class="anchor-link" href="#Part-B:-Bindings">&#182;</a></h3><p>Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!</p>
+<p>There are applications which can be used to determine the configuration of the processor. Among those are:</p>
+<ul>
+<li><code>lscpu</code>: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.</li>
+<li><code>ppc64_cpu --smt</code>: Specifically for POWER, this tool can give information about the number of simulations threads running per core (<em>SMT</em>, Simulataion Multi-Threading).</li>
+</ul>
+<p>Run <code>ppc64_cpu --smt</code> to find out about the threading configuration of Ascent!</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[55]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC19_SUBMIT_CMD</span> ppc64_cpu --smt
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>Job &lt;24465&gt; is submitted to default queue &lt;batch&gt;.
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+SMT=4
+</pre>
+</div>
+</div>
+
 </div>
 </div>
 
@@ -14117,18 +15455,22 @@ poisson2d_ofast_nopref:
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>If you feel up to the task, you can study the number of L3 cache misses using the corresponding performance counter, <code>PM_L3_MISS</code>. Either use your knowledge from Hands-On 1, or use the following call to <code>perf</code>, in which we already converted the named counter to a raw counter address.</p>
+<p>There are more sources information available</p>
+<ul>
+<li><code>/proc/cpuinfo</code>: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with <code>cat</code></li>
+<li><code>/sys/devices/system/cpu/cpu0/topology/thread_siblings_list</code>: Holds information about thread siblings for given CPU core (<code>cpu0</code> in this case). Use it to find out which thread is mapped to which core.</li>
+</ul>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[35]:</div>
+<div class="prompt input_prompt">In&nbsp;[36]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;poisson2d_ofast_nopref&quot;</span><span class="p">,</span> <span class="s2">&quot;poisson2d_ofast_pref&quot;</span><span class="p">]:</span>
-    <span class="o">!</span><span class="nb">eval</span> <span class="nv">$$</span>SC18_SUBMIT_CMD perf stat -e cycles,r168a4 ./<span class="nv">$f</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">$$</span>SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
+<span class="o">!</span><span class="nv">$$</span>SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list
 </pre></div>
 
     </div>
@@ -14145,54 +15487,14 @@ poisson2d_ofast_nopref:
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Job &lt;5048&gt; is submitted to default queue &lt;batch&gt;.
+<pre>Job &lt;24949&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-
- Performance counter stats for &#39;./poisson2d_ofast_nopref&#39;:
-
-        2829292169      cycles:u                                                    
-         136018637      r168a4:u                                                    
-
-       0.826136863 seconds time elapsed
-
-Job &lt;5049&gt; is submitted to default queue &lt;batch&gt;.
+0-3
+Job &lt;24950&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-
- Performance counter stats for &#39;./poisson2d_ofast_pref&#39;:
-
-        2654990243      cycles:u                                                    
-         128824827      r168a4:u                                                    
-
-       0.775593651 seconds time elapsed
-
+4-7
 </pre>
 </div>
 </div>
@@ -14204,10 +15506,10 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h4 id="References">References<a class="anchor-link" href="#References">&#182;</a></h4><ol>
-<li><a href="https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html">https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html</a></li>
-<li><a href="https://www.gnu.org/software/gcc/projects/prefetch.html">https://www.gnu.org/software/gcc/projects/prefetch.html</a></li>
-</ol>
+<p>There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the <a href="https://www.openmp.org/spec-html/5.0/openmpse53.html">OMP_PLACES environment Variable</a>. We also have a GNU specific variable which can also be used to control affinity - <code>GOMP_CPU_AFFINITY</code>. Setting <code>GOMP_CPU_AFFINITY</code> is specific to GCC binaries but it internally serves the same function as setting <code>OMP_PLACES</code>.</p>
+<p><strong>Task</strong>: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.</p>
+<p>Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.</p>
+<p>What's your maximum speedup?</p>
 
 </div>
 </div>
@@ -14215,8 +15517,7 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p><a href="#top">Back to Top</a></p>
-<hr>
+<p>Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores.</p>
 
 </div>
 </div>
@@ -14224,19 +15525,19 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-3:-OpenMP">Task 3: OpenMP<a class="anchor-link" href="#Task-3:-OpenMP">&#182;</a></h2><p><a name="task3"></a></p>
-<h3 id="Overview">Overview<a class="anchor-link" href="#Overview">&#182;</a></h3><p>We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores.</p>
-<p>First, we need to change directory to that of Task3.</p>
+<p>Using <code>OMP_PLACES</code> for binding, and using some magical Python-Bash interplay:</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[1]:</div>
+<div class="prompt input_prompt">In&nbsp;[43]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">%</span><span class="k">cd</span> ../Task3
+<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">affinity</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;</span><span class="si">{0}</span><span class="s2">,</span><span class="si">{1}</span><span class="s2">,</span><span class="si">{2}</span><span class="s2">,</span><span class="si">{3}</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;</span><span class="si">{0}</span><span class="s2">,</span><span class="si">{5}</span><span class="s2">,</span><span class="si">{9}</span><span class="s2">,</span><span class="si">{13}</span><span class="s2">&quot;</span><span class="p">]:</span>
+    <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Affinity: </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">affinity</span><span class="p">))</span>
+    <span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">OMP_PLACES</span><span class="o">=</span><span class="nv">$affinity</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>  <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
 </pre></div>
 
     </div>
@@ -14253,7 +15554,16 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task3
+<pre>Affinity: {0},{1},{2},{3}
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+  OMP_PLACES = &#39;{0},{1},{2},{3}&#39;
+1000x1000: Ref:   4.7315 s, This:   3.9090 s, speedup:     1.21
+Affinity: {0},{5},{9},{13}
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+  OMP_PLACES = &#39;{0},{5},{9},{13}&#39;
+1000x1000: Ref:   4.6485 s, This:   1.2829 s, speedup:     3.62
 </pre>
 </div>
 </div>
@@ -14265,23 +15575,19 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-A:-Implement-OpenMP-Pragmas;-Compilation">Part A: Implement OpenMP Pragmas; Compilation<a class="anchor-link" href="#Part-A:-Implement-OpenMP-Pragmas;-Compilation">&#182;</a></h3><p><strong>Task</strong>: Please add the correct OpenMP pragmas to the source code and compilations flags to enable OpenMP.</p>
-<ul>
-<li><strong>pragmas</strong>: Look at the TODOs in <a href="/edit/Task3/poisson2d.c"><code>poisson2d.c</code></a> to add OpenMP parallelism. The pragmas in question are <code>#pragma  omp parallel for</code></li>
-<li><strong>Compilation</strong>: Please add compilation flags enabling OpenMP in GCC to the <a href="/edit/Task3/Makefile">Makefile</a>. The flag in question is <code>-fopenmp</code>.</li>
-</ul>
-<p>Edit the files with the links above if you are running the interactive version of the Notebook or navigate to <code>poisson2d.c</code> and <code>Makefile</code> yourself in case you run the non-interactive version.</p>
-<p>Afterwards, compile and run the application with the following cells. Non-interactive: Follow along accordingly in the shell.</p>
+<p>In this case, we carry out the same experiment using <code>GOMP_CPU_AFFINITY</code> which essentially sets the same environment variable <code>OMP_PLACES</code>. Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[37]:</div>
+<div class="prompt input_prompt">In&nbsp;[44]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make poisson2d
+<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">affinity</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;0,1,2,3&quot;</span><span class="p">,</span> <span class="s2">&quot;0,5,9,13&quot;</span><span class="p">]:</span>
+    <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Affinity: </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">affinity</span><span class="p">))</span>
+    <span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">GOMP_CPU_AFFINITY</span><span class="o">=</span><span class="nv">$affinity</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
 </pre></div>
 
     </div>
@@ -14298,8 +15604,16 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  poisson2d_reference.c -o poisson2d_reference.o  -lm
-/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm
+<pre>Affinity: 0,1,2,3
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+  OMP_PLACES = &#39;{0},{1},{2},{3}&#39;
+1000x1000: Ref:   2.3964 s, This:   2.1361 s, speedup:     1.12
+Affinity: 0,5,9,13
+&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+  OMP_PLACES = &#39;{0},{5},{9},{13}&#39;
+1000x1000: Ref:   2.3925 s, This:   0.7030 s, speedup:     3.40
 </pre>
 </div>
 </div>
@@ -14307,13 +15621,24 @@ Calculate current execution.
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Great!</p>
+<p>If you still have time: The same experiments can be repeated with the IBM XL compiler. 
+The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is <code>-qsmp=omp</code></p>
+<p><strong>Task</strong>: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup.</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[40]:</div>
+<div class="prompt input_prompt">In&nbsp;[44]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make run
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>make <span class="nv">CC</span><span class="o">=</span>xlc_r -B run
 </pre></div>
 
     </div>
@@ -14330,26 +15655,40 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d
-Job &lt;5052&gt; is submitted to default queue &lt;batch&gt;.
+<pre>xlc_r -c -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp    poisson2d_reference.c -o poisson2d_reference.o -lm 
+    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.
+xlc_r -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp   poisson2d.c poisson2d_reference.o -o poisson2d -lm
+    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.
+bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS   time ./poisson2d
+Job &lt;24956&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh
+Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh
 Calculate reference solution and time with serial CPU execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
+    0, 0.249995
+  100, 50.149062
+  200, 99.849327
+  300, 149.352369
+  400, 198.659746
+  500, 247.773000
+  600, 296.693652
+  700, 345.423208
+  800, 393.963155
+  900, 442.314962
 Calculate current execution.
-    0, 0.249980
-  100, 0.246028
-  200, 0.242198
-  300, 0.238487
-  400, 0.234887
-500x500: Ref:   0.2571 s, This:   0.2946 s, speedup:     0.87
-1.48user 0.00system 0:00.56elapsed 263%CPU (0avgtext+0avgdata 9664maxresident)k
-0inputs+0outputs (0major+273minor)pagefaults 0swaps
+    0, 0.249995
+  100, 50.149062
+  200, 99.849327
+  300, 149.352369
+  400, 198.659746
+  500, 247.773000
+  600, 296.693652
+  700, 345.423208
+  800, 393.963155
+  900, 442.314962
+1000x1000: Ref:   5.6783 s, This:   2.6528 s, speedup:     2.14
+21.56user 6.18system 0:08.37elapsed 331%CPU (0avgtext+0avgdata 23040maxresident)k
+3200inputs+0outputs (2major+1098minor)pagefaults 0swaps
 </pre>
 </div>
 </div>
@@ -14361,17 +15700,25 @@ Calculate current execution.
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>The command to submit a job to the batch system is prepared in an environment variable <code>$SC18_SUBMIT_CMD</code>; use it together with <code>eval</code>. In the following cell, it is shown how to increase the work of the application.</p>
+<p>Run the parallel application with varying numbre of threads (<code>OMP_NUM_THREADS</code>) and note the performance improvement.</p>
+
+</div>
+</div>
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Just as in the GCC binary we see a similar speedup with higher number of threads until a certain point beyond which the benefit tapers off.</p>
 
 </div>
 </div>
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[3]:</div>
+<div class="prompt input_prompt">In&nbsp;[28]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC18_SUBMIT_CMD</span> ./poisson2d <span class="m">1000</span> <span class="m">1000</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">1</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
 </pre></div>
 
     </div>
@@ -14388,33 +15735,9 @@ Calculate current execution.
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Job &lt;5344&gt; is submitted to default queue &lt;batch&gt;.
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-Jacobi relaxation calculation: max 1000 iterations on 1000 x 100 mesh
-Calculate reference solution and time with serial CPU execution.
-    0, 0.249743
-  100, 0.210080
-  200, 0.184635
-  300, 0.166526
-  400, 0.152783
-  500, 0.141890
-  600, 0.132978
-  700, 0.125511
-  800, 0.119142
-  900, 0.113632
-Calculate current execution.
-    0, 0.249743
-  100, 0.210080
-  200, 0.184635
-  300, 0.166526
-  400, 0.152783
-  500, 0.141890
-  600, 0.132978
-  700, 0.125511
-  800, 0.119142
-  900, 0.113632
-1000x100: Ref:   1.9872 s, This:   0.2385 s, speedup:     8.33
+1000x1000: Ref:   2.2561 s, This:   2.6432 s, speedup:     0.85
 </pre>
 </div>
 </div>
@@ -14423,23 +15746,45 @@ Calculate current execution.
 </div>
 
 </div>
-<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div><div class="inner_cell">
-<div class="text_cell_render border-box-sizing rendered_html">
-<p>What is the best performance you can reach by setting the number of threads via <code>OMP_NUM_THREADS=N</code> with <code>N</code> being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.<br>
-We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler of Ascent, from overlaying binding options. Also, we use <code>-c ALL_CPUS</code> to make all CPUs on the compute nodes available to you.</p>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[29]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">2</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3071 s, This:   1.5343 s, speedup:     1.50
+</pre>
+</div>
+</div>
 
 </div>
 </div>
+
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[23]:</div>
+<div class="prompt input_prompt">In&nbsp;[30]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">omp_num</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">40</span><span class="p">]:</span>
-    <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Threads: </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">omp_num</span><span class="p">))</span>
-    <span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="nv">$omp_num</span> <span class="nv">$$</span>SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
 </pre></div>
 
     </div>
@@ -14456,34 +15801,42 @@ We added <code>--bind none</code> to prevent <code>jsrun</code>, the scheduler o
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Threads: 1
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
-&lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.3037 s, This:   2.8420 s, speedup:     0.81
-Threads: 2
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
-&lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.2998 s, This:   1.4320 s, speedup:     1.61
-Threads: 4
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
-&lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.3135 s, This:   0.7168 s, speedup:     3.23
-Threads: 8
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
-&lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.3145 s, This:   0.5278 s, speedup:     4.39
-Threads: 10
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
-&lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.3153 s, This:   0.4848 s, speedup:     4.78
-Threads: 20
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.3190 s, This:   0.2016 s, speedup:    11.50
-Threads: 40
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
+1000x1000: Ref:   2.2617 s, This:   0.6936 s, speedup:     3.26
+</pre>
+</div>
+</div>
+
+</div>
+</div>
+
+</div>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[31]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">8</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-1000x1000: Ref:   2.3243 s, This:   0.3057 s, speedup:     7.60
+1000x1000: Ref:   2.2728 s, This:   0.3402 s, speedup:     6.68
 </pre>
 </div>
 </div>
@@ -14492,26 +15845,45 @@ Threads: 40
 </div>
 
 </div>
-<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div><div class="inner_cell">
-<div class="text_cell_render border-box-sizing rendered_html">
-<h3 id="Part-B:-Bindings">Part B: Bindings<a class="anchor-link" href="#Part-B:-Bindings">&#182;</a></h3><p>Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!</p>
-<p>There are applications which can be used to determine the configuration of the processor. Among those are:</p>
-<ul>
-<li><code>lscpu</code>: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.</li>
-<li><code>ppc64_cpu --smt</code>: Specifically for POWER, this tool can give information about the number of simulations threads running per core (<em>SMT</em>, Simulataion Multi-Threading).</li>
-</ul>
-<p>Run <code>ppc64_cpu --smt</code> to find out about the threading configuration of Ascent!</p>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[45]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">10</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
+
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.1678 s, This:   0.2869 s, speedup:     7.56
+</pre>
+</div>
+</div>
 
 </div>
 </div>
+
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[48]:</div>
+<div class="prompt input_prompt">In&nbsp;[33]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nb">eval</span> <span class="nv">$SC18_SUBMIT_CMD</span> ppc64_cpu --smt
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">20</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
 </pre></div>
 
     </div>
@@ -14528,10 +15900,9 @@ Threads: 40
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Job &lt;5076&gt; is submitted to default queue &lt;batch&gt;.
-&lt;&lt;Waiting for dispatch ...&gt;&gt;
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-SMT=4
+1000x1000: Ref:   2.2813 s, This:   0.1452 s, speedup:    15.71
 </pre>
 </div>
 </div>
@@ -14540,25 +15911,45 @@ SMT=4
 </div>
 
 </div>
-<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
-</div><div class="inner_cell">
-<div class="text_cell_render border-box-sizing rendered_html">
-<p>There are more sources information available</p>
-<ul>
-<li><code>/proc/cpuinfo</code>: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with <code>cat</code></li>
-<li><code>/sys/devices/system/cpu/cpu0/topology/thread_siblings_list</code>: Holds information about thread siblings for given CPU core (<code>cpu0</code> in this case). Use it to find out which thread is mapped to which core.</li>
-</ul>
+<div class="cell border-box-sizing code_cell rendered">
+<div class="input">
+<div class="prompt input_prompt">In&nbsp;[34]:</div>
+<div class="inner_cell">
+    <div class="input_area">
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">40</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
+</pre></div>
+
+    </div>
+</div>
+</div>
 
+<div class="output_wrapper">
+<div class="output">
+
+
+<div class="output_area">
+
+    <div class="prompt"></div>
+
+
+<div class="output_subarea output_stream output_stdout output_text">
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.3284 s, This:   0.0981 s, speedup:    23.75
+</pre>
 </div>
 </div>
+
+</div>
+</div>
+
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[49]:</div>
+<div class="prompt input_prompt">In&nbsp;[35]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span>cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
-<span class="o">!</span>cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list
+<div class=" highlight hl-ipython3"><pre><span></span><span class="o">!</span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">80</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span> <span class="p">|</span> grep speedup 
 </pre></div>
 
     </div>
@@ -14575,8 +15966,9 @@ SMT=4
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>0-3
-4-7
+<pre>&lt;&lt;Waiting for dispatch ...&gt;&gt;
+&lt;&lt;Starting on login1&gt;&gt;
+1000x1000: Ref:   2.2918 s, This:   0.1439 s, speedup:    15.92
 </pre>
 </div>
 </div>
@@ -14588,9 +15980,10 @@ SMT=4
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>There are various environment variables available within OpenMP (and GCC) to specify binding of threads to cores. See, for instance, the <a href="https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html">online documentation of GCC libgomp</a>. Examples are <code>OMP_PLACES</code> or <code>GOMP_CPU_AFFINITY</code>.</p>
+<p>Now we repeat the exercise of using the right binding of threads for the XL binary. <code>OMP_PLACES</code> pertains to the XL binary as well as it is an OpenMP variable.  <code>GOMP_CPU_AFFINITY</code> is specific to GCC binary so that cannot be used to set the binding.</p>
 <p><strong>Task</strong>: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.</p>
 <p>Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.</p>
+<p>We are mixing Python with Bash (<code>!</code>) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two <code>$$</code>)</p>
 <p>What's your maximum speedup?</p>
 
 </div>
@@ -14598,12 +15991,12 @@ SMT=4
 </div>
 <div class="cell border-box-sizing code_cell rendered">
 <div class="input">
-<div class="prompt input_prompt">In&nbsp;[24]:</div>
+<div class="prompt input_prompt">In&nbsp;[36]:</div>
 <div class="inner_cell">
     <div class="input_area">
-<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">affinity</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;0,1,2,3&quot;</span><span class="p">,</span> <span class="s2">&quot;0,5,9,13&quot;</span><span class="p">]:</span>
+<div class=" highlight hl-ipython3"><pre><span></span><span class="k">for</span> <span class="n">affinity</span> <span class="ow">in</span> <span class="p">[</span><span class="s2">&quot;</span><span class="si">{0}</span><span class="s2">,</span><span class="si">{1}</span><span class="s2">,</span><span class="si">{2}</span><span class="s2">,</span><span class="si">{3}</span><span class="s2">&quot;</span><span class="p">,</span> <span class="s2">&quot;</span><span class="si">{0}</span><span class="s2">,</span><span class="si">{5}</span><span class="s2">,</span><span class="si">{9}</span><span class="s2">,</span><span class="si">{13}</span><span class="s2">&quot;</span><span class="p">]:</span>
     <span class="nb">print</span><span class="p">(</span><span class="s2">&quot;Affinity: </span><span class="si">{}</span><span class="s2">&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">affinity</span><span class="p">))</span>
-    <span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">GOMP_CPU_AFFINITY</span><span class="o">=</span><span class="nv">$affinity</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">100</span> <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
+    <span class="o">!</span><span class="nb">eval</span> <span class="nv">OMP_DISPLAY_ENV</span><span class="o">=</span><span class="nb">true</span> <span class="nv">OMP_PLACES</span><span class="o">=</span><span class="nv">$affinity</span> <span class="nv">OMP_NUM_THREADS</span><span class="o">=</span><span class="m">4</span> <span class="nv">$$</span>SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d <span class="m">1000</span> <span class="m">1000</span> <span class="m">1000</span>  <span class="p">|</span> grep <span class="s2">&quot;OMP_PLACES\|speedup&quot;</span>
 </pre></div>
 
     </div>
@@ -14620,16 +16013,16 @@ SMT=4
 
 
 <div class="output_subarea output_stream output_stdout output_text">
-<pre>Affinity: 0,1,2,3
+<pre>Affinity: {0},{1},{2},{3}
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-  OMP_PLACES = &#39;{0},{1},{2},{3}&#39;
-1000x100: Ref:   1.9854 s, This:   0.2326 s, speedup:     8.53
-Affinity: 0,5,9,13
+  OMP_PLACES=&#39;{0},{1},{2},{3}&#39; custom
+1000x1000: Ref:   5.9792 s, This:   2.4122 s, speedup:     2.48
+Affinity: {0},{5},{9},{13}
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-  OMP_PLACES = &#39;{0},{5},{9},{13}&#39;
-1000x100: Ref:   1.9828 s, This:   0.0833 s, speedup:    23.80
+  OMP_PLACES=&#39;{0},{5},{9},{13}&#39; custom
+1000x1000: Ref:   2.3101 s, This:   0.6884 s, speedup:     3.36
 </pre>
 </div>
 </div>
@@ -14637,12 +16030,21 @@ Affinity: 0,5,9,13
 </div>
 </div>
 
+</div>
+<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
+</div><div class="inner_cell">
+<div class="text_cell_render border-box-sizing rendered_html">
+<p>Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements</p>
+
+</div>
+</div>
 </div>
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h4 id="References">References<a class="anchor-link" href="#References">&#182;</a></h4><ol>
 <li><a href="https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html">https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html</a></li>
+<li><a href="https://www.openmp.org/spec-html/5.0/openmpse53.html">https://www.openmp.org/spec-html/5.0/openmpse53.html</a></li>
 </ol>
 
 </div>
@@ -14660,7 +16062,7 @@ Affinity: 0,5,9,13
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h1 id="Survey">Survey<a name="survey" /><a class="anchor-link" href="#Survey">&#182;</a></h1><p>Please rememeber to take some time and fill out the <a href="http://bit.ly/sc18-eval">survey</a>.</p>
+<h1 id="Survey">Survey<a name="survey" /><a class="anchor-link" href="#Survey">&#182;</a></h1><p>Please rememeber to take some time and fill out the <a href="http://bit.ly/sc19-eval">survey</a>.</p>
 
 </div>
 </div>
diff --git a/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.ipynb b/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.ipynb
index e6a4bae373575fc8c12ea313f38d5c7f431b582c..5208f9bd748fa633251b20a7cfba86bcd94f7acf 100644
--- a/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.ipynb
+++ b/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.ipynb
@@ -1 +1,2416 @@
-{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Hands-On Performance Optimization\n", "_Supercomputing 2018 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 12th 2018_\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.\n", "\n", "## Jupyter notebook execution\n", "\n", "When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n", "\n", "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n", "\n", "If you want you also can get a [terminal](/terminals/1) in your browser.\n", "\n", "## Terminal fallback\n", "\n", "The tasks are place in directories named `Task[1-3]`.\n", "\n", "Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Setup\n", "\n", "This hands-on session requires of GCC 6.4.0. By loading the `sc18/handson2` module before invoking this Notebook, we took care of also loading GCC 6.4.0 into the environment."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Tasks<a name=\"top\"></a>\n", "\n", "This session comes with multiple tasks, each one to be found in the respective sub-directory `Task[1-3]`. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n", "\n", "Please choose from the task below.\n", "\n", "\n", "* [Task 1](#task1): Compile Flags  \n", "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback ([Solution 1](#solution0))\n", "\n", "* [Task 2](#task2): Software Prefetching  \n", "Improve performance of the CPU Jacobi solver with software prefetching ([Solution 2](#solution1))\n", "\n", "* [Task 3](#task3): OpenMP  \n", "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance ([Solution 3](#solution2))\n", "  \n", "* [Suvery](#survey) Please remember to take the survey !\n", "    \n", "### Make Targets <a name=\"make\"></a>\n", "\n", "For all tasks we have defined the following make targets. \n", "\n", "* __poisson2d__:  \n", "  build `poisson2d` binary (default)\n", "* __run__:  \n", "   run `poisson2d` with default parameters\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Task 1: Compile Flags <a name=\"task1\"></a>\n", "\n", "\n", "### Overview\n", "\n", "The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver  \n", "\n", "Your task is to:\n", "\n", "* Optimize performance with `-Ofast` flag\n", "* Optimize performance with profile directed feedback \n", "\n", "First, change the working directory to `Task1`."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task1\n"]}], "source": ["%cd Task1"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part A: `-Ofast` vs. `-O3`\n", "\n", "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. Right now, the Makefile specifies `-O3` as the optimization flag. Compile the code using `make` and run it with `make run` in the next two cells."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -O3 -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"]}], "source": ["!make"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n", "Job <5033> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "1.13user 0.00system 0:01.15elapsed 97%CPU (0avgtext+0avgdata 10944maxresident)k\n", "2560inputs+0outputs (1major+264minor)pagefaults 0swaps\n"]}], "source": ["!make run"]}, {"cell_type": "markdown", "metadata": {}, "source": ["You can use the GNU _perf_ tool to profile the application using the `perf` command (see below) and see the top time-consuming functions."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "[ perf record: Woken up 1 times to write data ]\n", "[ perf record: Captured and wrote 0.172 MB perf.O3.data (4125 samples) ]\n", "# To display the perf.data header info, please use --header/--header-only options.\n", "#\n", "#\n", "# Total Lost Samples: 0\n", "#\n", "# Samples: 4K of event 'cycles:u'\n", "# Event count (approx.): 3867635297\n", "#\n", "# Overhead  Command    Shared Object      Symbol                                  \n", "# ........  .........  .................  ........................................\n", "#\n", "    72.02%  poisson2d  poisson2d          [.] 00000040.plt_call.fmax@@GLIBC_2.17\n", "    10.16%  poisson2d  poisson2d          [.] poisson2d_reference\n", "     9.99%  poisson2d  poisson2d          [.] main\n", "     4.69%  poisson2d  libc-2.17.so       [.] __memcpy_power7\n", "     2.23%  poisson2d  libm-2.17.so       [.] __fmaxf\n", "     0.75%  poisson2d  libm-2.17.so       [.] __exp_finite\n", "     0.07%  poisson2d  poisson2d          [.] 00000040.plt_call.memcpy@@GLIBC_2.17\n", "     0.02%  poisson2d  poisson2d          [.] check_results\n", "     0.02%  poisson2d  libm-2.17.so       [.] __GI___exp\n", "     0.01%  poisson2d  ld-2.17.so         [.] _dl_relocate_object\n", "     0.01%  poisson2d  [kernel.kallsyms]  [k] arch_local_irq_restore\n", "     0.00%  poisson2d  ld-2.17.so         [.] _dl_new_object\n", "     0.00%  poisson2d  ld-2.17.so         [.] _start\n", "\n", "\n", "#\n", "# (Tip: Show user configuration overrides: perf config --user --list)\n", "#\n"]}], "source": ["# perf record creates a perf.data file \n", "!perf record -o perf.O3.data -e cycles ./poisson2d\n", "# perf report opens the perf.data file \n", "!perf report -i perf.O3.data | cat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**TASK**: Now change the optimization flag in the [Makefile](/edit/Task1/Makefile) to `-Ofast` and repeat the steps in the following cell. In case you follow along non-interactive, call `make` and `make run` in your shell. (If you are in the Jupyter Notebook, you can actually click the link of the [Makefile](/edit/Task1/Makefile). In other cases, use `vim` which is installed on Ascent.)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"]}], "source": ["!make"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n", "Job <5034> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.51user 0.00system 0:00.52elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+264minor)pagefaults 0swaps\n"]}], "source": ["!make run"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "[ perf record: Woken up 1 times to write data ]\n", "[ perf record: Captured and wrote 0.086 MB perf.Ofast.data (1889 samples) ]\n", "# To display the perf.data header info, please use --header/--header-only options.\n", "#\n", "#\n", "# Total Lost Samples: 0\n", "#\n", "# Samples: 1K of event 'cycles:u'\n", "# Event count (approx.): 1765737747\n", "#\n", "# Overhead  Command    Shared Object  Symbol                 \n", "# ........  .........  .............  .......................\n", "#\n", "    44.65%  poisson2d  poisson2d      [.] main\n", "    43.84%  poisson2d  poisson2d      [.] poisson2d_reference\n", "    10.28%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n", "     1.12%  poisson2d  libm-2.17.so   [.] __exp_finite\n", "     0.05%  poisson2d  poisson2d      [.] check_results\n", "     0.03%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n", "     0.02%  poisson2d  libc-2.17.so   [.] __readdir64\n", "     0.01%  poisson2d  ld-2.17.so     [.] _dl_new_object\n", "     0.00%  poisson2d  ld-2.17.so     [.] _start\n", "\n", "\n", "#\n", "# (Tip: System-wide collection from all CPUs: perf record -a)\n", "#\n"]}], "source": ["# perf record creates a perf.data file \n", "!perf record -o perf.Ofast.data -e cycles ./poisson2d\n", "# perf report opens the perf.data file \n", "!perf report -i perf.Ofast.data | cat"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If `perf` is unavailable to you on other machines, you can also study the disassembly with `objdump`: `objdump -lSd ./poisson2d > poisson2d.dis` (feel free to experiment with this in the Notebook as well, just prefix the command with a `!` to execute it.)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["####  Interpretation\n", "\n", "Depending on the application requirement, if a high precision of results is not mandatory, the users can compile an application with `-Ofast` which enables `\u2013ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part B: Profile-directed Feedback\n", "\n", "For the first level of optimization we saw `Ofast` cut the execution time of the `O3` binary by almost half.\n", "\n", "We can optimize the performance further by using profile directed feedback optimization.\n", "\n", "To compile using profile directed feedback with the GCC compiler we need to do the following steps\n", "\n", "1. We need to first build a training binary using `-fprofile-generate`; this instructs the compiler to record hot path information \n", "2. Run the training binary with a smaller input size; you should see a `.gcda` file generated which stores hot path information for further optimization by the compiler \n", "3. build the final binary using `-fprofile-use` which uses the profile information in the `.gcda` file \n", "4. Compare the performance of the final binary with the original `Ofast` binary \n", "\n", "**TASK**: First, search for `TODO1` in the [Makefile](/edit/Task1/Makefile). It defines an additional compilation flag for `gcc`. Insert `-fprofile-generate=FOLDER` there with FOLDER pointing to `$$SC18_DIR_SCRATCH`, your personal write-directory (the double dollar signs are intentional as they are used to escape in the GNU Make syntax).\n", "\n", "After editing, run the following two cells to train your program."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-generate=$SC18_DIR_SCRATCH\" -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-generate=$SC18_DIR_SCRATCH\" -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_train  -lm\n"]}], "source": ["!make poisson2d_train"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_train 200 64 64\n", "Job <5035> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 200 iterations on 64 x 64 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.248743\n", "  100, 0.124046\n", "Calculate current execution.\n", "    0, 0.248743\n", "  100, 0.124046\n", "0.00user 0.00system 0:00.10elapsed 5%CPU (0avgtext+0avgdata 5248maxresident)k\n", "512inputs+0outputs (0major+115minor)pagefaults 0swaps\n", "mv $SC18_DIR_SCRATCH/*.gcda .\n"]}], "source": ["!make run_train"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, a `.gcda` file exists in the directory which can be used for an profile-accelerated subsequent run.\n", "\n", "**TASK**: Edit the [Makefile](/edit/Task1/Makefile) again, this time modifying `TODO2` to be equivalent to `-fprofile-use`. A directory is not needed as we copied the gcda file into the current directory.\n", "\n", "Run the following cells in order to build using the newly added flag and then run with the profile-accelerated version."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-use\" -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE  -mvsx -maltivec  \"-fprofile-use\" -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_profile  -lm\n"]}], "source": ["!make poisson2d_profile"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_profile\n", "Job <5036> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.47user 0.00system 0:00.48elapsed 98%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"]}], "source": ["!make run_profile"]}, {"cell_type": "markdown", "metadata": {}, "source": ["What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### References\n", "\n", "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n", "2. https://perf.wiki.kernel.org/index.php/Tutorial"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Task 2:<a name=\"task2\"></a> Software Pretechting\n", "\n", "\n", "### Overview\n", "\n", "Study the difference of program execution time of different optimization levels with and without software prefetching.\n", "\n", "First, change directory to that of Task 2"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task2\n"]}], "source": ["%cd ../Task2"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part A: Running"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Look at the [Makefile](/edit/Task2/Makefile) and work on the TODOs. Please implement compile flags as mentioned in the Makefile target name.\n", "\n", "Afterwards, compile each target with the following cells and submit them to the batch system. Follow along accordingly in the non-interactive version of this Notebook."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_pref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_pref\n", "Job <5037> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "1.12user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10880maxresident)k\n", "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"]}], "source": ["!make run_o3_pref"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fprefetch-loop-arrays -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_pref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_pref\n", "Job <5038> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.77user 0.00system 0:00.77elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+264minor)pagefaults 0swaps\n"]}], "source": ["!make run_ofast_pref"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -O3 -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_o3_nopref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_o3_nopref\n", "Job <5039> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "1.13user 0.00system 0:01.13elapsed 99%CPU (0avgtext+0avgdata 10944maxresident)k\n", "256inputs+0outputs (0major+266minor)pagefaults 0swaps\n"]}], "source": ["!make run_o3_nopref"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE  -mvsx -maltivec  -Ofast -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d_ofast_nopref  -lm\n", "bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d_ofast_nopref\n", "Job <5040> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "0.82user 0.00system 0:00.82elapsed 99%CPU (0avgtext+0avgdata 10816maxresident)k\n", "256inputs+0outputs (0major+265minor)pagefaults 0swaps\n"]}], "source": ["!make run_ofast_nopref"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Do you notice the impact difference with optimization levels? It's always important to carefully study the interplay of flags."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part B: Analysis of Instructions\n", "\n", "Compilation with the software prefetching flag causes the compiler to generate the `__dcbt` and `__dcbtst`  instructions that prefetch memory values to L3.\n", "\n", "Verify it using `objdump -lSd` on each file (`poisson2d_o3_pref`, `poisson2d_ofast_pref`, `poisson2d_o3_nopref`, `poisson2d_ofast_nopref`). You might want to grep for `dcb`."]}, {"cell_type": "code", "execution_count": 19, "metadata": {"sc18": "solution"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["poisson2d_o3_pref:\n", "poisson2d_ofast_pref:\n", "    10000da0:\tec f1 00 7c \tdcbtst  0,r30\n", "    10000da4:\t2c fa 00 7c \tdcbt    0,r31\n", "    10000da8:\t2c 62 00 7c \tdcbt    0,r12\n", "    10000dac:\t2c b2 00 7c \tdcbt    0,r22\n", "    10000dcc:\t2c e2 00 7c \tdcbt    0,r28\n", "    10000dd0:\t2c ea 00 7c \tdcbt    0,r29\n", "    100010b4:\t2c 62 00 7c \tdcbt    0,r12\n", "    100010b8:\t2c 5a 00 7c \tdcbt    0,r11\n", "    100010c4:\tec 19 00 7c \tdcbtst  0,r3\n", "    100010cc:\t2c 22 00 7c \tdcbt    0,r4\n", "    100010d0:\t2c ea 00 7c \tdcbt    0,r29\n", "    100010d4:\t2c f2 00 7c \tdcbt    0,r30\n", "    100010dc:\t2c fa 00 7c \tdcbt    0,r31\n", "poisson2d_o3_nopref:\n", "poisson2d_ofast_nopref:\n"]}], "source": ["for f in [\"poisson2d_o3_pref\", \"poisson2d_ofast_pref\", \"poisson2d_o3_nopref\", \"poisson2d_ofast_nopref\"]:\n", "    print(\"{}:\".format(f))\n", "    objdump -lSd $f |\u00a0grep dcb"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you feel up to the task, you can study the number of L3 cache misses using the corresponding performance counter, `PM_L3_MISS`. Either use your knowledge from Hands-On 1, or use the following call to `perf`, in which we already converted the named counter to a raw counter address."]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Job <5048> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "\n", " Performance counter stats for './poisson2d_ofast_nopref':\n", "\n", "        2829292169      cycles:u                                                    \n", "         136018637      r168a4:u                                                    \n", "\n", "       0.826136863 seconds time elapsed\n", "\n", "Job <5049> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "\n", " Performance counter stats for './poisson2d_ofast_pref':\n", "\n", "        2654990243      cycles:u                                                    \n", "         128824827      r168a4:u                                                    \n", "\n", "       0.775593651 seconds time elapsed\n", "\n"]}], "source": ["for f in [\"poisson2d_ofast_nopref\", \"poisson2d_ofast_pref\"]:\n", "    !eval $$SC18_SUBMIT_CMD perf stat -e cycles,r168a4 ./$f\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### References\n", "\n", "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n", "2. https://www.gnu.org/software/gcc/projects/prefetch.html"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Task 3: OpenMP\n", "<a name=\"task3\"></a>\n", "\n", "\n", "### Overview\n", "\n", "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores.\n", "\n", "First, we need to change directory to that of Task3."]}, {"cell_type": "code", "execution_count": 1, "metadata": {"ExecuteTime": {"end_time": "2018-11-07T13:47:57.724441Z", "start_time": "2018-11-07T13:47:57.718745Z"}}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/autofs/nccsopen-svm1_home/aherten/SC18-Tutorial/3-Optimizing_POWER/Handson/Task3\n"]}], "source": ["%cd ../Task3"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part A: Implement OpenMP Pragmas; Compilation\n", "\n", "**Task**: Please add the correct OpenMP pragmas to the source code and compilations flags to enable OpenMP.\n", "\n", "* **pragmas**: Look at the TODOs in [`poisson2d.c`](/edit/Task3/poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for`\n", "* **Compilation**: Please add compilation flags enabling OpenMP in GCC to the [Makefile](/edit/Task3/Makefile). The flag in question is `-fopenmp`.\n", "\n", "Edit the files with the links above if you are running the interactive version of the Notebook or navigate to `poisson2d.c` and `Makefile` yourself in case you run the non-interactive version.\n", "\n", "Afterwards, compile and run the application with the following cells. Non-interactive: Follow along accordingly in the shell."]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["/sw/ascent/gcc/6.4.0/bin/gcc -c -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  poisson2d_reference.c -o poisson2d_reference.o  -lm\n", "/sw/ascent/gcc/6.4.0/bin/gcc -std=c99 -mcpu=power9 -Ofast -DUSE_DOUBLE -mvsx -maltivec  -fopenmp poisson2d.c poisson2d_reference.o -o poisson2d  -lm\n"]}], "source": ["!make poisson2d"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["bsub -W 60 -nnodes 1 -Is -P GEN111 jsrun -n 1 -c 1 -g ALL_GPUS time ./poisson2d\n", "Job <5052> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 500 iterations on 500 x 500 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "Calculate current execution.\n", "    0, 0.249980\n", "  100, 0.246028\n", "  200, 0.242198\n", "  300, 0.238487\n", "  400, 0.234887\n", "500x500: Ref:   0.2571 s, This:   0.2946 s, speedup:     0.87\n", "1.48user 0.00system 0:00.56elapsed 263%CPU (0avgtext+0avgdata 9664maxresident)k\n", "0inputs+0outputs (0major+273minor)pagefaults 0swaps\n"]}], "source": ["!make run"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The command to submit a job to the batch system is prepared in an environment variable `$SC18_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to increase the work of the application."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Job <5344> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "Jacobi relaxation calculation: max 1000 iterations on 1000 x 100 mesh\n", "Calculate reference solution and time with serial CPU execution.\n", "    0, 0.249743\n", "  100, 0.210080\n", "  200, 0.184635\n", "  300, 0.166526\n", "  400, 0.152783\n", "  500, 0.141890\n", "  600, 0.132978\n", "  700, 0.125511\n", "  800, 0.119142\n", "  900, 0.113632\n", "Calculate current execution.\n", "    0, 0.249743\n", "  100, 0.210080\n", "  200, 0.184635\n", "  300, 0.166526\n", "  400, 0.152783\n", "  500, 0.141890\n", "  600, 0.132978\n", "  700, 0.125511\n", "  800, 0.119142\n", "  900, 0.113632\n", "1000x100: Ref:   1.9872 s, This:   0.2385 s, speedup:     8.33\n"]}], "source": ["!eval $SC18_SUBMIT_CMD ./poisson2d 1000 1000"]}, {"cell_type": "markdown", "metadata": {}, "source": ["What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n", "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."]}, {"cell_type": "code", "execution_count": 23, "metadata": {"sc18": "solution"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Threads: 1\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3037 s, This:   2.8420 s, speedup:     0.81\n", "Threads: 2\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.2998 s, This:   1.4320 s, speedup:     1.61\n", "Threads: 4\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3135 s, This:   0.7168 s, speedup:     3.23\n", "Threads: 8\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3145 s, This:   0.5278 s, speedup:     4.39\n", "Threads: 10\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3153 s, This:   0.4848 s, speedup:     4.78\n", "Threads: 20\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3190 s, This:   0.2016 s, speedup:    11.50\n", "Threads: 40\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "1000x1000: Ref:   2.3243 s, This:   0.3057 s, speedup:     7.60\n"]}], "source": ["for omp_num in [1, 2, 4, 8, 10, 20, 40]:\n", "    print(\"Threads: {}\".format(omp_num))\n", "    !eval OMP_NUM_THREADS=$omp_num $$SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 | grep speedup"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Part B: Bindings\n", "\n", "Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!\n", "\n", "There are applications which can be used to determine the configuration of the processor. Among those are:\n", "\n", "* `lscpu`: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.\n", "* `ppc64_cpu --smt`: Specifically for POWER, this tool can give information about the number of simulations threads running per core (*SMT*, Simulataion Multi-Threading).\n", "\n", "Run `ppc64_cpu --smt` to find out about the threading configuration of Ascent!"]}, {"cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Job <5076> is submitted to default queue <batch>.\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "SMT=4\n"]}], "source": ["!eval $SC18_SUBMIT_CMD ppc64_cpu --smt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["There are more sources information available\n", "\n", "* `/proc/cpuinfo`: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with `cat`\n", "* `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list`: Holds information about thread siblings for given CPU core (`cpu0` in this case). Use it to find out which thread is mapped to which core."]}, {"cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0-3\n", "4-7\n"]}], "source": ["!cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n", "!cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"]}, {"cell_type": "markdown", "metadata": {}, "source": ["There are various environment variables available within OpenMP (and GCC) to specify binding of threads to cores. See, for instance, the [online documentation of GCC libgomp](https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html). Examples are `OMP_PLACES` or `GOMP_CPU_AFFINITY`.\n", "\n", "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n", "\n", "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n", "\n", "What's your maximum speedup?"]}, {"cell_type": "code", "execution_count": 24, "metadata": {"sc18": "solution"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Affinity: 0,1,2,3\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "  OMP_PLACES = '{0},{1},{2},{3}'\n", "1000x100: Ref:   1.9854 s, This:   0.2326 s, speedup:     8.53\n", "Affinity: 0,5,9,13\n", "<<Waiting for dispatch ...>>\n", "<<Starting on login1>>\n", "  OMP_PLACES = '{0},{5},{9},{13}'\n", "1000x100: Ref:   1.9828 s, This:   0.0833 s, speedup:    23.80\n"]}], "source": ["for affinity in [\"0,1,2,3\", \"0,5,9,13\"]:\n", "    print(\"Affinity: {}\".format(affinity))\n", "    !eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=$affinity OMP_NUM_THREADS=4 $$SC18_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep \"OMP_PLACES\\|speedup\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### References\n", "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[Back to Top](#top)\n", "\n", "---"]}, {"cell_type": "markdown", "metadata": {}, "source": ["# Survey<a name=\"survey\"></a>\n", "\n", "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc18-eval)."]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7"}}, "nbformat": 4, "nbformat_minor": 2}
\ No newline at end of file
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hands-On Performance Optimization\n",
+    "_Supercomputing 2019 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 18th 2019_\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.\n",
+    "\n",
+    "## Jupyter notebook execution\n",
+    "\n",
+    "When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n",
+    "\n",
+    "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
+    "\n",
+    "If you want you also can get a terminal in your browser; just open it via the \u00bbNew Launcher\u00ab button (`+`).\n",
+    "\n",
+    "## Terminal fallback\n",
+    "\n",
+    "The tasks are place in directories named `Task[1-3]`.\n",
+    "\n",
+    "Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc (GCC) 9.2.0\n",
+      "Copyright (C) 2019 Free Software Foundation, Inc.\n",
+      "This is free software; see the source for copying conditions.  There is NO\n",
+      "warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!gcc --version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Tasks<a name=\"top\"></a>\n",
+    "\n",
+    "This session comes with multiple tasks, each one to be found in the respective sub-directory `Task[1-3]`. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n",
+    "\n",
+    "Please choose from the task below.\n",
+    "\n",
+    "\n",
+    "* [Task 1](#task1): __Basic compiler optimization flags and compiler annotations__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with compiler flags such as `Ofast` and profile-directed feedback. Learn about compiler annotations.\n",
+    "\n",
+    "* [Task 2](#task2): __Optimization via Prefetching controlled by compiler__\n",
+    "\n",
+    "Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance. \n",
+    "* [Task 3](#task3): __Optimization via OpenMP controlled by compiler and the system__\n",
+    "\n",
+    "Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance. \n",
+    "  \n",
+    "* [Suvery](#survey) Please remember to take the survey !\n",
+    "    \n",
+    "### Make Targets <a name=\"make\"></a>\n",
+    "\n",
+    "For all tasks we have defined the following make targets. \n",
+    "\n",
+    "* __poisson2d__:  \n",
+    "  build `poisson2d` binary (default)\n",
+    "* __run__:  \n",
+    "   run `poisson2d` with default parameters\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 1: Basic compiler optimization flags and compiler annotations <a name=\"task1\"></a>\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "The goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver  \n",
+    "\n",
+    "Your task is to:\n",
+    "\n",
+    "* Optimize performance with `-Ofast` flag\n",
+    "* Verify the cause for performance improvement by viewing perf profiles of O3 and Ofast binaries \n",
+    "* Optimize performance with profile directed feedback \n",
+    "* Generate compiler annotations/remarks to understand the optimizations done by the compiler with and without profile directed feedback \n",
+    "\n",
+    "\n",
+    "First, change the working directory to `Task1`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task1\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd Task1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: `-Ofast` vs. `-O3`\n",
+    "\n",
+    "We are to compare the performance of the binary being compiled with `-Ofast` optimization and with `-O3` optimization. As in the previous task, we use a `Makefile` for compilation. The `Makefile` targets `poisson2d_O3` and `poisson2d_Ofast` are already prepared. \n",
+    "\n",
+    "**TASK**: Add `-O3` as the optimization flag for the `poisson2d_O3` target by using the corresponding `CFLAGS` definition. There are notes relating to this Task 1 in the header of the `Makefile`. Compile the code using `make` as indicated below and run with the `Make` targets `run`, `run_perf` and `run_perf_recrep`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -c -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d_reference.c -o poisson2d_reference.o  -lm\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -O3   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_O3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 73,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24897> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's have a look at the output of the `Makefile` target `run_perf`. It invokes the GNU _perf_ tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to _perf_."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 74,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24898> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16264721613      cycles:u                                                    \n",
+      "       28463907825      instructions:u            #    1.75  insn per cycle                                            \n",
+      "\n",
+      "       4.738444892 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we run the makefile with target `run_perf_recrep` that prints the top routines of the application in terms of hotness by using a combination of `perf record ./app` and `perf report`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 75,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24899> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "[ perf record: Woken up 3 times to write data ]\n",
+      "[ perf record: Captured and wrote 0.739 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (19102 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24900> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "# To display the perf.data header info, please use --header/--header-only options.\n",
+      "#\n",
+      "#\n",
+      "# Total Lost Samples: 0\n",
+      "#\n",
+      "# Samples: 19K of event 'cycles:u'\n",
+      "# Event count (approx.): 16254596654\n",
+      "#\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
+      "#\n",
+      "    65.50%  poisson2d  poisson2d      [.] 00000038.plt_call.fmax@@GLIBC_2.17\n",
+      "    21.21%  poisson2d  poisson2d      [.] main\n",
+      "     9.18%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     3.28%  poisson2d  libm-2.17.so   [.] __fmaxf\n",
+      "     0.74%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libm-2.17.so   [.] __GI___exp\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _wordcopy_fwd_aligned\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
+      "\n",
+      "\n",
+      "#\n",
+      "# (Tip: Limit to show entries above 5% only: perf report --percent-limit 5)\n",
+      "#\n"
+     ]
+    }
+   ],
+   "source": [
+    "# run_perf_recrep displays the top hot routines \n",
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Now add the optimization flag `Ofast` to the `CFLAGS` for target `poisson2d_Ofast`. Compile the program with the target `poisson2d_Ofast` and run and analyse it as before with `run`, `run_perf` and `run_perf_recrep`.\n",
+    "\n",
+    "What difference do you see?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24901> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.41user 0.00system 0:02.41elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast \n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Again, run a `perf`-instrumented version:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24902> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        8258991976      cycles:u                                                    \n",
+      "       12013091172      instructions:u            #    1.45  insn per cycle                                            \n",
+      "\n",
+      "       2.408703909 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Generate the list of top routines in terms of hotness:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 78,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf record -e cycles --output=/gpfs/wolf/trn003/scratch/aherten//cycles.data ./poisson2d\n",
+      "Job <24903> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "[ perf record: Woken up 2 times to write data ]\n",
+      "[ perf record: Captured and wrote 0.382 MB /gpfs/wolf/trn003/scratch/aherten//cycles.data (9728 samples) ]\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf report -i /gpfs/wolf/trn003/scratch/aherten//cycles.data  --stdio\n",
+      "Job <24904> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "# To display the perf.data header info, please use --header/--header-only options.\n",
+      "#\n",
+      "#\n",
+      "# Total Lost Samples: 0\n",
+      "#\n",
+      "# Samples: 9K of event 'cycles:u'\n",
+      "# Event count (approx.): 8268811890\n",
+      "#\n",
+      "# Overhead  Command    Shared Object  Symbol                                  \n",
+      "# ........  .........  .............  ........................................\n",
+      "#\n",
+      "    81.12%  poisson2d  poisson2d      [.] main\n",
+      "    17.97%  poisson2d  libc-2.17.so   [.] __memcpy_power7\n",
+      "     0.79%  poisson2d  libm-2.17.so   [.] __exp_finite\n",
+      "     0.04%  poisson2d  poisson2d      [.] 00000038.plt_call.memcpy@@GLIBC_2.17\n",
+      "     0.02%  poisson2d  ld-2.17.so     [.] do_lookup_x\n",
+      "     0.01%  poisson2d  libc-2.17.so   [.] vfprintf@@GLIBC_2.17\n",
+      "     0.01%  poisson2d  libc-2.17.so   [.] _dl_addr\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_relocate_object\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] check_match.10253\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] _dl_lookup_symbol_x\n",
+      "     0.01%  poisson2d  ld-2.17.so     [.] strcmp\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] open_path\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] init_tls\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _dl_sysdep_start\n",
+      "     0.00%  poisson2d  ld-2.17.so     [.] _start\n",
+      "\n",
+      "\n",
+      "#\n",
+      "# (Tip: For tracepoint events, try: perf report -s trace_fields)\n",
+      "#\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf_recrep"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If `perf` is unavailable to you on other machines, you can also study the disassembly with `objdump`: `objdump -lSd ./poisson2d > poisson2d.dis` (feel free to experiment with this in the Notebook as well, just prefix the command with a `!` to execute it.)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "####  Interpretation\n",
+    "\n",
+    "Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with `-Ofast` which enables `\u2013ffast-math` option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the `-Ofast` binary natively implements the `fmax` function using instructions available in the hardware. The `-O3` binary makes a library call to compute `fmax` to follow a stricter _IEEE_ requirement for accuracy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Profile-directed Feedback\n",
+    "\n",
+    "For the first level of optimization we see that `Ofast` cut the execution time of the `O3` binary by almost half.\n",
+    "\n",
+    "We can optimize the performance further by using profile-directed feedback optimization.\n",
+    "\n",
+    "To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:\n",
+    "\n",
+    "1. Instrument binary;\n",
+    "2. Run binary with training, gather profile information;\n",
+    "3. Use profile information to generate optimized binary.\n",
+    "\n",
+    "\n",
+    "Step 1 is achieved by compiling the binary with the correct flag \u2013\u00a0`-fprofile-generate`. In our case, we need to specify an output location, which should be `$(SC19_DIR_SCRATCH)`.\n",
+    "\n",
+    "Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension `.gcda`) is written to the directory specified during compilation.\n",
+    "\n",
+    "For Step 3, the binary is once again compiled, but this time using the `gcda` profile just generated. The according flag is `-fprofile-use`, which we set to `$(SC19_DIR_SCRATCH)` as well.\n",
+    "\n",
+    "In our `Makefile` at hand, we prepared the steps already for you in the form of two targets.\n",
+    "\n",
+    "* `poisson2d_train`: Will compile the binary with profile-directed feedback\n",
+    "* `poisson2d_ref`: Will take a generated profile and compile a new, optimized binary\n",
+    "\n",
+    "By using dependencies, between these two targets a profile run is launched.\n",
+    "\n",
+    "**TASK**: Edit the [Makefile](`Makefile`) and add the `-fprofile-*` flags to the `CFLAGS` of `poisson2d_train` and\n",
+    "`poisson2d_ref` as outline in the file.\n",
+    "\n",
+    "After that, you may launch them with the following cells (`gen_profile` is a meta-target and uses `poisson2d_train` and `poisson2d_ref`). If you need to clean the generated profile, you may use `make clean_profile`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24905> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_ref -lm \n",
+      "cp poisson2d_ref poisson2d\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make gen_profile"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 80,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24906> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.28user 0.01system 0:02.30elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Great! It is! In our tests, this shaved off another 5%."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's also measure instructions and cycles"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 81,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,instructions ./poisson2d\n",
+      "Job <24907> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        7925983538      cycles:u                                                    \n",
+      "       12253080719      instructions:u            #    1.55  insn per cycle                                            \n",
+      "\n",
+      "       2.313471365 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make run_perf"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Compiler annotations/Remarks\n",
+    "\n",
+    "Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done. \n",
+    "\n",
+    "To generate compiler annotations using GCC, one uses `-fopt-info-all`. If you only want to see the missed options, use the option `-fopt-info-missed` instead of `-fopt-info-all`. See also the [documentation of GCC regarding the flag](https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info).\n",
+    "\n",
+    "**TASK**: Have a looK at the `CFLAGS` of the `Makefile` target `poisson2d_Ofast_info`. Add the flag `-fopt-info-all` to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use \u2013\u00a0for example \u2013\u00a0`-fopt-info-all=(SC19_DIR_SCRATCH)/filename`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all poisson2d.c poisson2d_reference.o -o poisson2d_Ofast_info  -lm\n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_528, 0, _531);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_543, _539, _549);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_Ofast_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.\n",
+    "\n",
+    "**TASK**: \n",
+    "Adapt the `CFLAGS` of `poisson2d_ref_info` to include `-fopt-info-all` **and** the profile input of `-fprofile-use=\u2026` here. *(Be advised: Long output!)*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ -Ofast -fprofile-generate=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c  -o poisson2d_train -lm \n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "Increasing alignment of decl: __gcov0.main\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_D_00100_1_main/48 -> __gcov_exit/55, function body not available\n",
+      "poisson2d.c:164:1: missed:   not inlinable: _GLOBAL__sub_I_00100_0_main/47 -> __gcov_init/54, function body not available\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 295->295 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:122:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:88:5: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:72:5: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_632, 0, _239);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_337, ny_124, nx_286);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_64, _135, _313);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_316, error_118);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_127);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_311);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_122);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_129 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_132 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_140 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 53\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 50\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 47\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 3 times (header execution count 9800)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 33\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 30\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 42\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 40\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 60\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 23\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 3 times (header execution count 100)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 12\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 16\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_init (&*.LPBX0);\n",
+      "poisson2d.c:164:1: missed: statement clobbers memory: __gcov_exit ();\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d_train 100 100 100\n",
+      "Job <24908> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "libgcov profiling error:/gpfs/wolf/trn003/scratch/aherten//#autofs#nccsopen-svm1_home#aherten#SC19-Tutorial#3-Optimizing_POWER#Handson#Task1#poisson2d.gcda:overwriting an existing profile data with a different timestamp\n",
+      "Jacobi relaxation calculation: max 100 iterations on 100 x 100 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249490\n",
+      "echo `date` > /gpfs/wolf/trn003/scratch/aherten//.profile_generated\n",
+      "gcc -std=c99 -mcpu=power9 -DUSE_DOUBLE -mvsx -maltivec -Ofast -fopt-info-all -fprofile-use=/gpfs/wolf/trn003/scratch/aherten/ poisson2d.c poisson2d_reference.o -o poisson2d_ref_info  -lm\n",
+      "poisson2d.c:62:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:61:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:56:14: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:52:20: optimized:   Inlining atoi/24 into main/33 (always_inline).\n",
+      "poisson2d.c:161:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:159:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:158:5: missed:   not inlinable: main/33 -> free/38, function body not available\n",
+      "poisson2d.c:142:31: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:103:5: missed:   not inlinable: main/33 -> __builtin_puts/37, function body not available\n",
+      "poisson2d.c:96:5: missed:   not inlinable: main/33 -> printf/36, function body not available\n",
+      "poisson2d.c:78:29: missed:   not inlinable: main/33 -> exp/35, function body not available\n",
+      "poisson2d.c:68:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:67:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "poisson2d.c:65:41: missed:   not inlinable: main/33 -> malloc/34, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "/usr/include/stdlib.h:280:16: missed:   not inlinable: main/33 -> strtol/39, function body not available\n",
+      "Unit growth for small function inlining: 207->207 (0%)\n",
+      "\n",
+      "Inlined 4 calls, eliminated 0 functions\n",
+      "\n",
+      "consider run-time aliasing test between *_84 and *_87\n",
+      "consider run-time aliasing test between *_92 and *_97\n",
+      "consider run-time aliasing test between *_104 and *_107\n",
+      "consider run-time aliasing test between *_111 and *_115\n",
+      "poisson2d.c:124:13: optimized: Loop 8 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:90:9: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.\n",
+      "poisson2d.c:108:25: missed: couldn't vectorize loop\n",
+      "poisson2d.c:108:25: missed: not vectorized: loop nest containing two or more consecutive inner loops cannot be vectorized\n",
+      "poisson2d.c:136:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:136:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:131:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:131:9: missed: Loop costings may not be worthwhile.\n",
+      "poisson2d.c:122:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:112:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:112:9: missed: not vectorized: control flow in loop.\n",
+      "poisson2d.c:114:13: optimized: loop vectorized using 16 byte vectors\n",
+      "poisson2d.c:88:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:72:5: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:27: missed: not vectorized: complicated access pattern.\n",
+      "poisson2d.c:74:9: missed: couldn't vectorize loop\n",
+      "poisson2d.c:78:29: missed: not vectorized: relevant stmt not supported: _27 = exp (_21);\n",
+      "poisson2d.c:43:5: note: vectorized 1 loops in function.\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:114:13: optimized: loop turned into non-loop; it never loops\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _187 = strtol (_1, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _189 = strtol (_2, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _193 = strtol (_3, 0B, 10);\n",
+      "/usr/include/stdlib.h:280:16: missed: statement clobbers memory: _191 = strtol (_4, 0B, 10);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_153 = malloc (_7);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_155 = malloc (_7);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_157 = malloc (_7);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memset (_524, 0, _527);\n",
+      "poisson2d.c:96:5: missed: statement clobbers memory: printf (\"Jacobi relaxation calculation: max %d iterations on %d x %d mesh\\n\", iter_max_130, ny_139, nx_195);\n",
+      "poisson2d.c:103:5: missed: statement clobbers memory: __builtin_puts (&\"Calculate current execution.\"[0]);\n",
+      "poisson2d.c:43:5: missed: statement clobbers memory: __builtin_memcpy (_539, _535, _544);\n",
+      "poisson2d.c:142:31: missed: statement clobbers memory: printf (\"%5d, %0.6f\\n\", iter_237, error_219);\n",
+      "poisson2d.c:158:5: missed: statement clobbers memory: free (rhs_202);\n",
+      "poisson2d.c:159:5: missed: statement clobbers memory: free (Anew_124);\n",
+      "poisson2d.c:161:5: missed: statement clobbers memory: free (A_123);\n",
+      "poisson2d.c:65:41: missed: statement clobbers memory: A_144 = malloc (8000000);\n",
+      "poisson2d.c:67:41: missed: statement clobbers memory: Anew_143 = malloc (8000000);\n",
+      "poisson2d.c:68:41: missed: statement clobbers memory: rhs_142 = malloc (8000000);\n",
+      "poisson2d.c:136:9: note: considering unrolling loop 7 at BB 47\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:136:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:131:9: note: considering unrolling loop 6 at BB 44\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:131:9: optimized: loop unrolled 7 times (header execution count 9800)\n",
+      "poisson2d.c:122:9: note: considering unrolling loop 5 at BB 40\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:122:9: optimized: loop unrolled 7 times (header execution count 9701)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 13 at BB 27\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:118:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+      "poisson2d.c:118:25: note: considering unrolling loop 9 at BB 24\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:112:9: note: considering unrolling loop 14 at BB 37\n",
+      "poisson2d.c:43:5: note: considering unrolling loop 4 at BB 35\n",
+      "poisson2d.c:108:25: note: considering unrolling loop 3 at BB 51\n",
+      "poisson2d.c:88:5: note: considering unrolling loop 2 at BB 18\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:88:5: optimized: loop unrolled 7 times (header execution count 99)\n",
+      "poisson2d.c:74:9: note: considering unrolling loop 11 at BB 9\n",
+      "considering unrolling loop with constant number of iterations\n",
+      "considering unrolling loop with runtime-computable number of iterations\n",
+      "poisson2d.c:74:9: optimized: loop unrolled 3 times (header execution count 9604)\n",
+      "poisson2d.c:72:5: note: considering unrolling loop 1 at BB 14\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_ref_info"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Comparing the annotations generated of a plain `-Ofast` optimization level and the one generated at `-Ofast` and profile directed feedback, we observe that many more optimizations are possible due to profile information.\n",
+    "\n",
+    "For instance you will see annotations such as\n",
+    "```\n",
+    "poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)\n",
+    "```\n",
+    "\n",
+    "The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "\n",
+    "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
+    "2. https://perf.wiki.kernel.org/index.php/Tutorial"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 2:<a name=\"task2\"></a> Impact of Prefetching on Performance\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "* Study the difference of program execution time of different optimization levels with and without software prefetching.\n",
+    "* Verify the impact by measuring cache counters with and without prefetching.\n",
+    "* Learn how to modify contents of DSCR (*Data Stream Control Register*) using IBM XL compiler and study the impact with different values to DSCR. \n",
+    "\n",
+    "But first, lets change directory to that of Task 2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task2\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd ../Task2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: Software Prefetching"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Look at the Makefile and work on the TODOs. \n",
+    "\n",
+    "- First generate a `-Ofast`-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!\n",
+    "- Modify the `Makefile` to add the option for software prefetching (`-fprefetch-loop-arrays`). Compare performance of `-Ofast` with and without software prefetching"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 97,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rm -f poisson2d poisson2d*.o\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make clean"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "make: `poisson2d' is up to date.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24911> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "2.39user 0.01system 0:02.40elapsed 100%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24912> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        8271503902      cycles:u                                                    \n",
+      "         481152478      r168a4:u                                                    \n",
+      "\n",
+      "       2.412224884 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 98,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays -fprefetch-loop-arrays poisson2d.c -o poisson2d_pref  -lm\n",
+      "cp poisson2d_pref poisson2d\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24919> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1.92user 0.00system 0:01.93elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24920> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "        6586609284      cycles:u                                                    \n",
+      "         459879452      r168a4:u                                                    \n",
+      "\n",
+      "       1.925399505 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_pref CC=gcc\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: Repeat the experiment with the `-O3` flag. Have a look at the `Makefile` and the outlined TODO. There's a position to easily adapt `-Ofast`\u2192`-O3`!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 100,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   poisson2d.c  -o poisson2d  -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24923> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.73user 0.00system 0:04.73elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+479minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24924> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16445764669      cycles:u                                                    \n",
+      "         645094089      r168a4:u                                                    \n",
+      "\n",
+      "       4.792567763 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -O3   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24925> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.74user 0.00system 0:04.74elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "0inputs+0outputs (0major+480minor)pagefaults 0swaps\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS perf stat -e cycles,r168a4 ./poisson2d\n",
+      "Job <24926> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "\n",
+      " Performance counter stats for './poisson2d':\n",
+      "\n",
+      "       16239159454      cycles:u                                                    \n",
+      "         631061431      r168a4:u                                                    \n",
+      "\n",
+      "       4.730144897 seconds time elapsed\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_pref CC=gcc -B\n",
+    "!make run\n",
+    "!make l3missstats"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Observing the results, we see that SW Prefetching seems to help at `-Ofast` but not at `-O3`. We can use the steps described in the the next section to verify that the compiler has not inserted any SW prefetch operations at`-O3` at all. That is because in the `-O3` binary the time is dominated by `__fmax` call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of `fmax()`\n",
+    "GCC may add further loop optimizations such as unrolling upon invocation of `\u2013fprefetch-loop-arrays`.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Analysis of Instructions\n",
+    "\n",
+    "Compilation of the `-Ofast` binary with the software prefetching flag causes the compiler to generate the `dcb*`  instructions that prefetch memory values to L3."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**TASK**: \n",
+    "Run `$(SC19_SUBMIT_CMD) objdump -lSd` on each binary file (`-O3`, `-Ofast` with prefetch/no prefetch).\n",
+    "Look for instructions beginning with `dcb`\n",
+    "At what optimization levels does the compiler generate software prefetching instructions?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 114,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -std=c99 -DUSE_DOUBLE -Ofast   -mcpu=power9  -mvsx -maltivec   -fprefetch-loop-arrays poisson2d.c  -o poisson2d_pref  -lm\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=gcc -B poisson2d_pref\n",
+    "!objdump -lSd ./poisson2d_pref > poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 116,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    10000b28:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000b30:\t2c ba 00 7c \tdcbt    0,r23\n",
+      "    10000b38:\t2c b2 00 7c \tdcbt    0,r22\n",
+      "    10000b50:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000b58:\tec b9 00 7c \tdcbtst  0,r23\n",
+      "    10000b80:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000e64:\t2c 92 00 7c \tdcbt    0,r18\n",
+      "    10000e68:\t2c 9a 00 7c \tdcbt    0,r19\n",
+      "    10000e6c:\t2c a2 00 7c \tdcbt    0,r20\n",
+      "    10000e70:\t2c aa 00 7c \tdcbt    0,r21\n",
+      "    10000e7c:\t2c b2 00 7c \tdcbt    0,r22\n",
+      "    10000e80:\t2c d2 00 7c \tdcbt    0,r26\n",
+      "    10000e94:\tec b9 00 7c \tdcbtst  0,r23\n"
+     ]
+    }
+   ],
+   "source": [
+    "!grep dcb poisson2d.dis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part C: Changing Values of DSCR via compiler flags\n",
+    "\n",
+    "This task requires using the IBM XL compiler. It should be already in your environment.\n",
+    "\n",
+    "\n",
+    "We saw the impact of software prefetching in the previous subsection. \n",
+    "In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. \n",
+    "In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching. \n",
+    "\n",
+    "IBM XL compiler has an option `-qprefetch=dscr=<val>` that can be used for this purpose.\n",
+    "Compiling with `-qprefetch=dscr=1` turns off the prefetcher. One can give various values such as `-qprefetch=dscr=4`, `-qprefetch=dscr=7` etc. to control aggressiveness of prefetching.\n",
+    "\n",
+    "For this exercise we use `make CC=xlc_r` to illustrate the performance impact.\n",
+    "    \n",
+    "\n",
+    "**Task** Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the `Makefile`: Add `qprefetch=dscr=1` to the `CFLAGS` and rebuild the application and note the performance. Which one is faster? \n",
+    "\n",
+    "In general, applications benefit with the default settings of hardware DSCR register (`-qprefetch=dscr=0`). However, certain applications also benefit with prefetching turned off. \n",
+    "\n",
+    "It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL at default DSCR value"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 117,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  poisson2d.c -o poisson2d  -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24927> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "2.26user 0.00system 0:02.27elapsed 99%CPU (0avgtext+0avgdata 24256maxresident)k\n",
+      "256inputs+0outputs (0major+477minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=xlc_r -B poisson2d\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Measure performance of the application compiled with XL with DSCR value turned off"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r  -std=c99 -DUSE_DOUBLE -Ofast   -qarch=pwr9 -qtune=pwr9  -DINLINE_LIBS  -qprefetch=dscr=1 poisson2d.c -o poisson2d_dscr  -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS  time ./poisson2d\n",
+      "Job <24929> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "4.58user 0.00system 0:04.59elapsed 99%CPU (0avgtext+0avgdata 24192maxresident)k\n",
+      "0inputs+0outputs (0major+476minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d_dscr CC=xlc_r -B\n",
+    "!make run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher? "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "The DSCR register controls the operation of the HW Prefetcher on POWER9. It can be modified in the command line by `ppc64_cpu --dscr=<value>`. However this needs admin privileges. IBM XL offers a compiler flag to set the value through the compiler. `-qprefetch=dscr=1` turns off the prefetcher. Observing the results we see that the performance without the HW prefetcher is twice as bad as that with default prefetching. So we can conclude that Prefetching helps the Jacobi application. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "\n",
+    "1. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html\n",
+    "2. https://www.gnu.org/software/gcc/projects/prefetch.html\n",
+    "3. https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Task 3: OpenMP\n",
+    "<a name=\"task3\"></a>\n",
+    "\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.\n",
+    "First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is `poisson2d_reference`. We parallelize only the main loop but not `poisson2d_reference`. The speedup is the performance gain seen in the main loop as compared to the reference loop."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/aherten/SC19-Tutorial/3-Optimizing_POWER/Handson/Task3\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd ../Task3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part A: Implement OpenMP Pragmas; Compilation\n",
+    "\n",
+    "**Task**: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.\n",
+    "\n",
+    "* **Directives**: Look at the TODOs in [`poisson2d.c`](poisson2d.c) to add OpenMP parallelism. The pragmas in question are `#pragma  omp parallel for` (and once it's `#pragma omp parallel for reduction(max:error)` \u2013\u00a0can you guess where?)\n",
+    "* **Compilation**: Please add compilation flags enabling OpenMP in GCC and XL to the `Makefile`. For GCC, we need to add `-fopenmp` and the application needs to be linked with `-lgomp`. For XL, we need to add `-qsmp=omp` to the list of compilation flags. \n",
+    "\n",
+    "Afterwards, compile and run the application with the following commands."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "gcc -c -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp   poisson2d_reference.c -o poisson2d_reference.o -lm\n",
+      "gcc -std=c99 -DUSE_DOUBLE -O3 -mcpu=power9  -mvsx -maltivec   -fopenmp -lgomp  poisson2d.c poisson2d_reference.o -o poisson2d  -lm \n"
+     ]
+    }
+   ],
+   "source": [
+    "!make poisson2d CC=gcc"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The command to submit a job to the batch system is prepared in an environment variable `$SC19_SUBMIT_CMD`; use it together with `eval`. In the following cell, it is shown how to invoke the application using the batch system. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24951> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 0.248997\n",
+      "  200, 0.248007\n",
+      "  300, 0.247025\n",
+      "  400, 0.246050\n",
+      "  500, 0.245084\n",
+      "  600, 0.244124\n",
+      "  700, 0.243173\n",
+      "  800, 0.242228\n",
+      "  900, 0.241291\n",
+      "1000x1000: Ref:   4.7430 s, This:   3.9363 s, speedup:     1.20\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval $SC19_SUBMIT_CMD ./poisson2d 1000 1000 1000"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Inorder to run the parallel application, we need to set the number of threads using `OMP_NUM_THREADS`\n",
+    "What is the best performance you can reach by setting the number of threads via `OMP_NUM_THREADS=N` with `N` being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.  \n",
+    "We added `--bind none` to prevent `jsrun`, the scheduler of Ascent, from overlaying binding options. Also, we use `-c ALL_CPUS` to make all CPUs on the compute nodes available to you."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   4.7288 s, This:   4.9791 s, speedup:     0.95\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=1 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   4.7125 s, This:   2.4914 s, speedup:     1.89\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=2 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.1065 s, This:   1.3836 s, speedup:     1.52\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3868 s, This:   0.5272 s, speedup:     4.53\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=8 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3912 s, This:   0.4612 s, speedup:     5.18\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=10 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3864 s, This:   0.4037 s, speedup:     5.91\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=20 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3773 s, This:   0.3045 s, speedup:     7.81\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=40 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3819 s, This:   0.3081 s, speedup:     7.73\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=80 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Part B: Bindings\n",
+    "\n",
+    "Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!\n",
+    "\n",
+    "There are applications which can be used to determine the configuration of the processor. Among those are:\n",
+    "\n",
+    "* `lscpu`: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.\n",
+    "* `ppc64_cpu --smt`: Specifically for POWER, this tool can give information about the number of simulations threads running per core (*SMT*, Simulataion Multi-Threading).\n",
+    "\n",
+    "Run `ppc64_cpu --smt` to find out about the threading configuration of Ascent!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24465> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "SMT=4\n"
+     ]
+    }
+   ],
+   "source": [
+    "!eval $SC19_SUBMIT_CMD ppc64_cpu --smt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are more sources information available\n",
+    "\n",
+    "* `/proc/cpuinfo`: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with `cat`\n",
+    "* `/sys/devices/system/cpu/cpu0/topology/thread_siblings_list`: Holds information about thread siblings for given CPU core (`cpu0` in this case). Use it to find out which thread is mapped to which core."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Job <24949> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "0-3\n",
+      "Job <24950> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "4-7\n"
+     ]
+    }
+   ],
+   "source": [
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list\n",
+    "!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the [OMP_PLACES environment Variable](https://www.openmp.org/spec-html/5.0/openmpse53.html). We also have a GNU specific variable which can also be used to control affinity - `GOMP_CPU_AFFINITY`. Setting `GOMP_CPU_AFFINITY` is specific to GCC binaries but it internally serves the same function as setting `OMP_PLACES`. \n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Using `OMP_PLACES` for binding, and using some magical Python-Bash interplay:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: {0},{1},{2},{3}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{1},{2},{3}'\n",
+      "1000x1000: Ref:   4.7315 s, This:   3.9090 s, speedup:     1.21\n",
+      "Affinity: {0},{5},{9},{13}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{5},{9},{13}'\n",
+      "1000x1000: Ref:   4.6485 s, This:   1.2829 s, speedup:     3.62\n"
+     ]
+    }
+   ],
+   "source": [
+    "for affinity in [\"{0},{1},{2},{3}\", \"{0},{5},{9},{13}\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000  | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "In this case, we carry out the same experiment using `GOMP_CPU_AFFINITY` which essentially sets the same environment variable `OMP_PLACES`. Running with two different configurations 1) Binding all threads to the same core 2) Binding all threads to different cores, we see a higher speedup in case of binding all threads to different cores."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: 0,1,2,3\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{1},{2},{3}'\n",
+      "1000x1000: Ref:   2.3964 s, This:   2.1361 s, speedup:     1.12\n",
+      "Affinity: 0,5,9,13\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES = '{0},{5},{9},{13}'\n",
+      "1000x1000: Ref:   2.3925 s, This:   0.7030 s, speedup:     3.40\n"
+     ]
+    }
+   ],
+   "source": [
+    "for affinity in [\"0,1,2,3\", \"0,5,9,13\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Great!\n",
+    "\n",
+    "If you still have time: The same experiments can be repeated with the IBM XL compiler. \n",
+    "The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is `-qsmp=omp`\n",
+    "\n",
+    "**Task**: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "xlc_r -c -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp    poisson2d_reference.c -o poisson2d_reference.o -lm \n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "xlc_r -std=c99 -DUSE_DOUBLE -O3 -qhot -qtune=pwr9  -DINLINE_LIBS -qsmp=omp   poisson2d.c poisson2d_reference.o -o poisson2d -lm\n",
+      "    1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program.  Please refer to documentation on the STRICT/NOSTRICT option for more information.\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS   time ./poisson2d\n",
+      "Job <24956> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation calculation: max 1000 iterations on 1000 x 1000 mesh\n",
+      "Calculate reference solution and time with serial CPU execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "Calculate current execution.\n",
+      "    0, 0.249995\n",
+      "  100, 50.149062\n",
+      "  200, 99.849327\n",
+      "  300, 149.352369\n",
+      "  400, 198.659746\n",
+      "  500, 247.773000\n",
+      "  600, 296.693652\n",
+      "  700, 345.423208\n",
+      "  800, 393.963155\n",
+      "  900, 442.314962\n",
+      "1000x1000: Ref:   5.6783 s, This:   2.6528 s, speedup:     2.14\n",
+      "21.56user 6.18system 0:08.37elapsed 331%CPU (0avgtext+0avgdata 23040maxresident)k\n",
+      "3200inputs+0outputs (2major+1098minor)pagefaults 0swaps\n"
+     ]
+    }
+   ],
+   "source": [
+    "!make CC=xlc_r -B run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the parallel application with varying numbre of threads (`OMP_NUM_THREADS`) and note the performance improvement. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "Just as in the GCC binary we see a similar speedup with higher number of threads until a certain point beyond which the benefit tapers off. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2561 s, This:   2.6432 s, speedup:     0.85\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=1 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3071 s, This:   1.5343 s, speedup:     1.50\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=2 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2617 s, This:   0.6936 s, speedup:     3.26\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2728 s, This:   0.3402 s, speedup:     6.68\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=8 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.1678 s, This:   0.2869 s, speedup:     7.56\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=10 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2813 s, This:   0.1452 s, speedup:    15.71\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=20 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.3284 s, This:   0.0981 s, speedup:    23.75\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=40 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "1000x1000: Ref:   2.2918 s, This:   0.1439 s, speedup:    15.92\n"
+     ]
+    }
+   ],
+   "source": [
+    "!OMP_NUM_THREADS=80 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep speedup "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we repeat the exercise of using the right binding of threads for the XL binary. `OMP_PLACES` pertains to the XL binary as well as it is an OpenMP variable.  `GOMP_CPU_AFFINITY` is specific to GCC binary so that cannot be used to set the binding.\n",
+    "\n",
+    "**Task**: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.\n",
+    "\n",
+    "Adapt the following command with your configuration \u2013 or follow along accordingly in the non-interactive version of the Notebook.\n",
+    "\n",
+    "We are mixing Python with Bash (`!`) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two `$$`)\n",
+    "\n",
+    "What's your maximum speedup?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Affinity: {0},{1},{2},{3}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES='{0},{1},{2},{3}' custom\n",
+      "1000x1000: Ref:   5.9792 s, This:   2.4122 s, speedup:     2.48\n",
+      "Affinity: {0},{5},{9},{13}\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "  OMP_PLACES='{0},{5},{9},{13}' custom\n",
+      "1000x1000: Ref:   2.3101 s, This:   0.6884 s, speedup:     3.36\n"
+     ]
+    }
+   ],
+   "source": [
+    "for affinity in [\"{0},{1},{2},{3}\", \"{0},{5},{9},{13}\"]:\n",
+    "    print(\"Affinity: {}\".format(affinity))\n",
+    "    !eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000  | grep \"OMP_PLACES\\|speedup\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### References\n",
+    "1. https://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html\n",
+    "2. https://www.openmp.org/spec-html/5.0/openmpse53.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Survey<a name=\"survey\"></a>\n",
+    "\n",
+    "Please rememeber to take some time and fill out the [survey](http://bit.ly/sc19-eval)."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git a/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.pdf b/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.pdf
index 4357a6fc0c527eb59248f5962db5e7b0e70630f3..283819d6780fad07fd929bfced380e2ad08bb0c1 100644
Binary files a/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.pdf and b/3-Optimizing_POWER/Handson/Solution-Notebook/HandsOnPerformanceOptimization.pdf differ