diff --git a/4-GPU/HandsOn/.master/HandsOnGPUProgramming_Solution_executed.ipynb b/4-GPU/HandsOn/.master/HandsOnGPUProgramming_Solution_executed.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..9ea3fd00f467e8ba22083aa3ec6d3e86d5c599fd
--- /dev/null
+++ b/4-GPU/HandsOn/.master/HandsOnGPUProgramming_Solution_executed.ipynb
@@ -0,0 +1,2689 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hands-On GPU Programming\n",
+    "_Supercomputing 2019 Tutorial \"Application Porting and Optimization on GPU-Accelerated POWER Architectures\", November 18th 2019_\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solutions \n",
+    "\n",
+    "**This contains the output for the solutions.**\n",
+    "\n",
+    "The solutions are described in the solution section. Please navigate to the corresponding directory to find the solution profiles and sources.\n",
+    "\n",
+    "\n",
+    "### GPU Programming\n",
+    "\n",
+    "* [Solution 0](#solution0) Accelerate a CPU Jacobi solver with OpenACC relying on Unified Memory for data movement using `–ta=tesla:managed`  \n",
+    "  \n",
+    "\n",
+    "* [Solution 1](#solution1) Fix memory access pattern of OpenACC accelerated Jacobi Solver  \n",
+    "  \n",
+    "\n",
+    "### Multi-GPU with MPI\n",
+    "\n",
+    "* [Solution 2](#solution2) Use MPI to make OpenACC accelerated Jacobi Solver scale to multiple GPUs  \n",
+    "  \n",
+    "\n",
+    "* [Solution 3](#solution3) Hide MPI communication time by overlapping communication and \n",
+    "\tcomputation in a MPI+OpenACC multi GPU Jacobi Solver  \n",
+    "  \n",
+    "  \n",
+    "  \n",
+    "### Multi-GPU with NVSHMEM *(Advanced -- C only)*\n",
+    "  \n",
+    "* [Solution 4](#solution4) Use NVSHMEM instead of MPI  \n",
+    "  \n",
+    "\n",
+    "* [Solution 5](#solution5) Put NVSHMEM calls on stream to hide API calls and GPU/CPU synchronization  \n",
+    "  \n",
+    "\n",
+    "### Survey\n",
+    " \n",
+    " * Please remember to take the [suvery](#survey) !\n",
+    "\n",
+    "---\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "Please __select your language choice (C or FORTRAN) below__ by making sure your choice is uncommented and comment out the other language. Then execute the cell by hitting __`Shift+Enter`__!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "You selected C for the exercises.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# select language here\n",
+    "LANGUAGE='C'\n",
+    "#LANGUAGE='FORTRAN'\n",
+    "\n",
+    "## You should not touch the remaining code in the cell\n",
+    "import os.path\n",
+    "import pandas\n",
+    "\n",
+    "try: rootdir\n",
+    "except NameError: rootdir = None\n",
+    "if(not rootdir):\n",
+    "    rootdir=%pwd\n",
+    "basedir=os.path.join(rootdir,LANGUAGE)\n",
+    "basedirC=os.path.join(rootdir,'C')\n",
+    "\n",
+    "print (\"You selected {} for the exercises.\".format(LANGUAGE))\n",
+    "\n",
+    "def checkdir(dir):\n",
+    "    d=%pwd\n",
+    "    assert(d.endswith(dir) or d.endswith(dir+'p') or d.endswith(dir+'m')), \"Please make sure to cd to the right directory first.\"\n",
+    "\n",
+    "def cleanall():\n",
+    "# clean up everything -- use with care\n",
+    "    for t in range(4):\n",
+    "        d='%s/task%i'%(basedir,t)\n",
+    "        %cd $d\n",
+    "        !make clean\n",
+    "        \n",
+    "#cleanall()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task0\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task0"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "# Solutions<a name=\"solutions\"></a>\n",
+    "\n",
+    "Below are suggested solutions. This is only a short description of the solution, but the `poisson2d.solution.(c|F03)` files linked below have the full source code. If you want to run / profile the solutions feel free to duplicate the cells for the tasks and change the [make target](#make) to the `*.solution` ones.\n",
+    "\n",
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 0:<a name=\"solution0\"></a>\n",
+    "\n",
+    "```C++\n",
+    "#pragma acc parallel loop\n",
+    "for (int ix = ix_start; ix < ix_end; ix++)\n",
+    "{\n",
+    "    #pragma acc loop\n",
+    "    for( int iy = iy_start; iy < iy_end; iy++ )\n",
+    "    {\n",
+    "        Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1]\n",
+    "                                               + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix] ));\n",
+    "        error = fmaxr( error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](/C/task0/poisson2d.solution.c?edit=1)\n",
+    "* [Fortran Version](/edit/./FORTRAN/task0/poisson2d.solution.F03)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  __After__ the profiling finished the output file `poisson2d.solution.pgprof`  can be downloaded from here: [C Version](/tree/./C/task0/poisson2d.solution.pgprof?download=1) / [Fortran Version](./FORTRAN/task0/poisson2d.solution.pgprof?download=1).    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task0\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o poisson2d_serial.o\n",
+      "pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution\n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     66, Generating Tesla code\n",
+      "         67, #pragma acc loop gang /* blockIdx.x */\n",
+      "         68, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     66, Generating implicit copyout(A[:])\n",
+      "     68, Loop is parallelizable\n",
+      "     88, Generating Tesla code\n",
+      "         89, #pragma acc loop gang /* blockIdx.x */\n",
+      "         90, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         94, Generating implicit reduction(max:error)\n",
+      "     88, Generating implicit copyin(A[:],rhs[:])\n",
+      "         Generating implicit copyout(Anew[:])\n",
+      "     90, Loop is parallelizable\n",
+      "     98, Generating Tesla code\n",
+      "         99, #pragma acc loop gang /* blockIdx.x */\n",
+      "        100, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     98, Generating implicit copyin(Anew[:])\n",
+      "         Generating implicit copyout(A[:])\n",
+      "    100, Loop is parallelizable\n",
+      "    106, Generating Tesla code\n",
+      "        107, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    106, Generating implicit copyin(A[:])\n",
+      "         Generating implicit copyout(A[nx*(ny-1)+1:2046])\n",
+      "    111, Generating Tesla code\n",
+      "        112, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    111, Generating implicit copy(A[:])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task0')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d.solution\n",
+      "Job <25658> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
+      "Calculate reference solution and time serial CPU execution.\n",
+      "    0, 0.249999\n",
+      "  100, 0.249760\n",
+      "  200, 0.249522\n",
+      "  300, 0.249285\n",
+      "  400, 0.249048\n",
+      "GPU execution.\n",
+      "    0, 0.249999\n",
+      "  100, 0.249760\n",
+      "  200, 0.249522\n",
+      "  300, 0.249285\n",
+      "  400, 0.249048\n",
+      "2048x2048: 1 CPU:   5.4111 s, 1 GPU:   0.1905 s, speedup:    28.40\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task0')\n",
+    "!make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof ./poisson2d.solution 10\n",
+      "Job <25659> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==77763== PGPROF is profiling process 77763, command: ./poisson2d.solution 10\n",
+      "==77763== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof\n",
+      "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
+      "Calculate reference solution and time serial CPU execution.\n",
+      "    0, 0.249999\n",
+      "GPU execution.\n",
+      "    0, 0.249999\n",
+      "2048x2048: 1 CPU:   0.1194 s, 1 GPU:   0.0179 s, speedup:     6.67\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof .\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task0')\n",
+    "!make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 1:<a name=\"solution1\"></a>\n",
+    "\n",
+    "Swap the `ix` and `iy` loops to make sure that `ix` is the fastest running index \n",
+    "\n",
+    "```C\n",
+    "#pragma acc parallel loop\n",
+    "for (int iy = iy_start; iy < iy_end; iy++)\n",
+    "{\n",
+    "    for( int ix = ix_start; ix < ix_end; ix++ )\n",
+    "    {\n",
+    "        Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1]\n",
+    "                                               + A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix] ));\n",
+    "        error = fmaxr( error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](/edit/C/task1/poisson2d.solution.c)\n",
+    "* [Fortran Version](/edit/FORTRAN/task1/poisson2d.solution.F03)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  __After__ the profiling finished the output file `poisson2d.solution.pgprof`  can be downloaded from here: [C Version](/tree/C/task1/pgprof.poisson2d.Task1.solution.tar.gz?download=1) / [Fortran Version](/tree/FORTRAN/task1/pgprof.poisson2d.Task1.solution.tar.gz?download=1).  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task1\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task1"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed,lineinfo poisson2d_serial.c -o poisson2d_serial.o\n",
+      "pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed,lineinfo poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution\n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     66, Generating Tesla code\n",
+      "         67, #pragma acc loop gang /* blockIdx.x */\n",
+      "         68, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     66, Generating implicit copyout(A[:])\n",
+      "     68, Loop is parallelizable\n",
+      "     88, Generating Tesla code\n",
+      "         89, #pragma acc loop gang /* blockIdx.x */\n",
+      "         90, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         94, Generating implicit reduction(max:error)\n",
+      "     88, Generating implicit copyin(A[:],rhs[:])\n",
+      "         Generating implicit copyout(Anew[:])\n",
+      "     90, Loop is parallelizable\n",
+      "     98, Generating Tesla code\n",
+      "         99, #pragma acc loop gang /* blockIdx.x */\n",
+      "        100, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     98, Generating implicit copyin(Anew[:])\n",
+      "         Generating implicit copyout(A[:])\n",
+      "    100, Loop is parallelizable\n",
+      "    106, Generating Tesla code\n",
+      "        107, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    106, Generating implicit copyin(A[:])\n",
+      "         Generating implicit copyout(A[nx*(ny-1)+1:2046])\n",
+      "    111, Generating Tesla code\n",
+      "        112, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    111, Generating implicit copy(A[:])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task1')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d.solution\n",
+      "Job <25660> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
+      "Calculate reference solution and time serial CPU execution.\n",
+      "    0, 0.249999\n",
+      "  100, 0.249760\n",
+      "  200, 0.249522\n",
+      "  300, 0.249285\n",
+      "  400, 0.249048\n",
+      "GPU execution.\n",
+      "    0, 0.249999\n",
+      "  100, 0.249760\n",
+      "  200, 0.249522\n",
+      "  300, 0.249285\n",
+      "  400, 0.249048\n",
+      "2048x2048: 1 CPU:   5.3929 s, 1 GPU:   0.1903 s, speedup:    28.33\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task1')\n",
+    "!make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof ./poisson2d.solution 3\n",
+      "Job <25661> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==77997== PGPROF is profiling process 77997, command: ./poisson2d.solution 3\n",
+      "==77997== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof\n",
+      "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
+      "Calculate reference solution and time serial CPU execution.\n",
+      "    0, 0.249999\n",
+      "GPU execution.\n",
+      "    0, 0.249999\n",
+      "2048x2048: 1 CPU:   0.0437 s, 1 GPU:   0.0164 s, speedup:     2.66\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  --analysis-metrics -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof ./poisson2d.solution 3\n",
+      "Job <25662> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==79400== PGPROF is profiling process 79400, command: ./poisson2d.solution 3\n",
+      "==79400== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.\n",
+      "==79400== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof\n",
+      "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
+      "Calculate reference solution and time serial CPU execution.\n",
+      "    0, 0.249999\n",
+      "GPU execution.\n",
+      "    0, 0.249999\n",
+      "2048x2048: 1 CPU:   0.0475 s, 1 GPU:  12.3314 s, speedup:     0.00\n",
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  --metrics gld_efficiency,gst_efficiency -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof ./poisson2d.solution 3\n",
+      "Job <25663> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==78235== PGPROF is profiling process 78235, command: ./poisson2d.solution 3\n",
+      "==78235== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.\n",
+      "==78235== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof\n",
+      "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
+      "Calculate reference solution and time serial CPU execution.\n",
+      "    0, 0.249999\n",
+      "GPU execution.\n",
+      "    0, 0.249999\n",
+      "2048x2048: 1 CPU:   0.0483 s, 1 GPU:   0.6638 s, speedup:     0.07\n",
+      "pgprof --csv -i /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof 2>&1 | grep -v \"======\" > poisson2d.solution.efficiency.csv\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.*.pgprof .\n",
+      "tar -cvzf pgprof.poisson2d.Task1.solution.tar.gz  poisson2d.solution.*.pgprof\n",
+      "poisson2d.solution.efficiency.pgprof\n",
+      "poisson2d.solution.metrics.pgprof\n",
+      "poisson2d.solution.timeline.pgprof\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task1')\n",
+    "!make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "For the _Global Memory Load/Store Efficiency_ the `make profile` command also generated a CSV file that you can import and view with the cell below.  \n",
+    "If you purely work in a terminal you can view the same output by running `pgprof -i poisson2d.efficiency.solution.pgprof`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Device</th>\n",
+       "      <th>Kernel</th>\n",
+       "      <th>Invocations</th>\n",
+       "      <th>Metric Name</th>\n",
+       "      <th>Metric Description</th>\n",
+       "      <th>Min</th>\n",
+       "      <th>Max</th>\n",
+       "      <th>Avg</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_98_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gld_efficiency</td>\n",
+       "      <td>Global Memory Load Efficiency</td>\n",
+       "      <td>90.866222%</td>\n",
+       "      <td>91.051373%</td>\n",
+       "      <td>90.962535%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_98_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gst_efficiency</td>\n",
+       "      <td>Global Memory Store Efficiency</td>\n",
+       "      <td>88.956522%</td>\n",
+       "      <td>88.956522%</td>\n",
+       "      <td>88.956522%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_106_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gld_efficiency</td>\n",
+       "      <td>Global Memory Load Efficiency</td>\n",
+       "      <td>94.722222%</td>\n",
+       "      <td>94.722222%</td>\n",
+       "      <td>94.722222%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_106_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gst_efficiency</td>\n",
+       "      <td>Global Memory Store Efficiency</td>\n",
+       "      <td>88.956522%</td>\n",
+       "      <td>88.956522%</td>\n",
+       "      <td>88.956522%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_94_gpu__red</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gld_efficiency</td>\n",
+       "      <td>Global Memory Load Efficiency</td>\n",
+       "      <td>99.756335%</td>\n",
+       "      <td>99.756335%</td>\n",
+       "      <td>99.756335%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_94_gpu__red</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gst_efficiency</td>\n",
+       "      <td>Global Memory Store Efficiency</td>\n",
+       "      <td>25.000000%</td>\n",
+       "      <td>25.000000%</td>\n",
+       "      <td>25.000000%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_66_gpu</td>\n",
+       "      <td>1</td>\n",
+       "      <td>gld_efficiency</td>\n",
+       "      <td>Global Memory Load Efficiency</td>\n",
+       "      <td>0.000000%</td>\n",
+       "      <td>0.000000%</td>\n",
+       "      <td>0.000000%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_66_gpu</td>\n",
+       "      <td>1</td>\n",
+       "      <td>gst_efficiency</td>\n",
+       "      <td>Global Memory Store Efficiency</td>\n",
+       "      <td>100.000000%</td>\n",
+       "      <td>100.000000%</td>\n",
+       "      <td>100.000000%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_88_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gld_efficiency</td>\n",
+       "      <td>Global Memory Load Efficiency</td>\n",
+       "      <td>91.850475%</td>\n",
+       "      <td>91.857005%</td>\n",
+       "      <td>91.854824%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_88_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gst_efficiency</td>\n",
+       "      <td>Global Memory Store Efficiency</td>\n",
+       "      <td>88.845486%</td>\n",
+       "      <td>88.845486%</td>\n",
+       "      <td>88.845486%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_111_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gld_efficiency</td>\n",
+       "      <td>Global Memory Load Efficiency</td>\n",
+       "      <td>25.000000%</td>\n",
+       "      <td>25.000000%</td>\n",
+       "      <td>25.000000%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>Tesla V100-SXM2-16GB (0)</td>\n",
+       "      <td>main_111_gpu</td>\n",
+       "      <td>3</td>\n",
+       "      <td>gst_efficiency</td>\n",
+       "      <td>Global Memory Store Efficiency</td>\n",
+       "      <td>25.000000%</td>\n",
+       "      <td>25.000000%</td>\n",
+       "      <td>25.000000%</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                      Device            Kernel  Invocations     Metric Name  \\\n",
+       "0   Tesla V100-SXM2-16GB (0)       main_98_gpu            3  gld_efficiency   \n",
+       "1   Tesla V100-SXM2-16GB (0)       main_98_gpu            3  gst_efficiency   \n",
+       "2   Tesla V100-SXM2-16GB (0)      main_106_gpu            3  gld_efficiency   \n",
+       "3   Tesla V100-SXM2-16GB (0)      main_106_gpu            3  gst_efficiency   \n",
+       "4   Tesla V100-SXM2-16GB (0)  main_94_gpu__red            3  gld_efficiency   \n",
+       "5   Tesla V100-SXM2-16GB (0)  main_94_gpu__red            3  gst_efficiency   \n",
+       "6   Tesla V100-SXM2-16GB (0)       main_66_gpu            1  gld_efficiency   \n",
+       "7   Tesla V100-SXM2-16GB (0)       main_66_gpu            1  gst_efficiency   \n",
+       "8   Tesla V100-SXM2-16GB (0)       main_88_gpu            3  gld_efficiency   \n",
+       "9   Tesla V100-SXM2-16GB (0)       main_88_gpu            3  gst_efficiency   \n",
+       "10  Tesla V100-SXM2-16GB (0)      main_111_gpu            3  gld_efficiency   \n",
+       "11  Tesla V100-SXM2-16GB (0)      main_111_gpu            3  gst_efficiency   \n",
+       "\n",
+       "                Metric Description          Min          Max          Avg  \n",
+       "0    Global Memory Load Efficiency   90.866222%   91.051373%   90.962535%  \n",
+       "1   Global Memory Store Efficiency   88.956522%   88.956522%   88.956522%  \n",
+       "2    Global Memory Load Efficiency   94.722222%   94.722222%   94.722222%  \n",
+       "3   Global Memory Store Efficiency   88.956522%   88.956522%   88.956522%  \n",
+       "4    Global Memory Load Efficiency   99.756335%   99.756335%   99.756335%  \n",
+       "5   Global Memory Store Efficiency   25.000000%   25.000000%   25.000000%  \n",
+       "6    Global Memory Load Efficiency    0.000000%    0.000000%    0.000000%  \n",
+       "7   Global Memory Store Efficiency  100.000000%  100.000000%  100.000000%  \n",
+       "8    Global Memory Load Efficiency   91.850475%   91.857005%   91.854824%  \n",
+       "9   Global Memory Store Efficiency   88.845486%   88.845486%   88.845486%  \n",
+       "10   Global Memory Load Efficiency   25.000000%   25.000000%   25.000000%  \n",
+       "11  Global Memory Store Efficiency   25.000000%   25.000000%   25.000000%  "
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data_frame_solution = pandas.read_csv('poisson2d.solution.efficiency.csv', sep=',')\n",
+    "data_frame_solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 2:<a name=\"solution2\"></a>\n",
+    "\n",
+    "Set the GPU used by the rank using `#pragma acc set device_num`\n",
+    "```C\n",
+    "//Initialize MPI and determine rank and size\n",
+    "MPI_Init(&argc, &argv);\n",
+    "MPI_Comm_rank(MPI_COMM_WORLD, &rank);\n",
+    "MPI_Comm_size(MPI_COMM_WORLD, &size);\n",
+    "\n",
+    "#pragma acc set device_num( rank )\n",
+    "\n",
+    "real* restrict const A    = (real*) malloc(nx*ny*sizeof(real));\n",
+    "real* restrict const Aref = (real*) malloc(nx*ny*sizeof(real));\n",
+    "real* restrict const Anew = (real*) malloc(nx*ny*sizeof(real));\n",
+    "real* restrict const rhs  = (real*) malloc(nx*ny*sizeof(real));\n",
+    "```\n",
+    "\n",
+    "\n",
+    "Apply domain decomposition\n",
+    "```C\n",
+    "// Ensure correctness if ny%size != 0\n",
+    "int chunk_size = ceil( (1.0*ny)/size );\n",
+    "\n",
+    "int iy_start = rank * chunk_size;\n",
+    "int iy_end   = iy_start + chunk_size;\n",
+    "\n",
+    "// Do not process boundaries\n",
+    "iy_start = max( iy_start, 1 );\n",
+    "iy_end = min( iy_end, ny - 1 );\n",
+    "```\n",
+    "\n",
+    "Exchange data\n",
+    "```C\n",
+    "//Periodic boundary conditions\n",
+    "int top    = (rank == 0) ? (size-1) : rank-1;\n",
+    "int bottom = (rank == (size-1)) ? 0 : rank+1;\n",
+    "#pragma acc host_data use_device( A )\n",
+    "{\n",
+    "    double start_mpi = MPI_Wtime();\n",
+    "    //1. Sent row iy_start (first modified row) to top receive lower boundary (iy_end) from bottom\n",
+    "    MPI_Sendrecv( A+iy_start*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top   , 0,\n",
+    "                  A+iy_end*nx+ix_start,   (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,\n",
+    "                  MPI_COMM_WORLD, MPI_STATUS_IGNORE );\n",
+    "\n",
+    "    //2. Sent row (iy_end-1) (last modified row) to bottom receive upper boundary (iy_start-1) from top\n",
+    "    MPI_Sendrecv( A+(iy_end-1)*nx+ix_start,   (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,\n",
+    "                  A+(iy_start-1)*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top   , 0,\n",
+    "                  MPI_COMM_WORLD, MPI_STATUS_IGNORE );\n",
+    "    mpi_time += MPI_Wtime() - start_mpi;\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](/edit/C/task2/poisson2d.solution.c)\n",
+    "* [Fortran Version](/edit/FORTRAN/task2/poisson2d.solution.F03)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  You can profile the code by executing the next cell. __After__ the profiling completed download the tarball containing the profiles (`pgprof.Task2.solution.poisson2d.tar.gz`) with the File Browser. \n",
+    "Then you can import them into pgprof / nvvp using the _Import_ option in the _File_ menu. Remember to use the _Multiple processes_ option in the assistant. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task2\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task2"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mpicc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d_serial.c -o poisson2d_serial.o\n",
+      "poisson2d_serial:\n",
+      "     36, Generating present(Anew[:],rhs[:],Aref[:])\n",
+      "     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])\n",
+      "     42, Generating Tesla code\n",
+      "         43, #pragma acc loop gang /* blockIdx.x */\n",
+      "         44, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         49, Generating implicit reduction(max:error)\n",
+      "     44, Loop is parallelizable\n",
+      "     53, Generating Tesla code\n",
+      "         54, #pragma acc loop gang /* blockIdx.x */\n",
+      "         55, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     55, Loop is parallelizable\n",
+      "     61, Generating Tesla code\n",
+      "         62, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     66, Generating Tesla code\n",
+      "         67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     78, Generating update self(Aref[:ny*nx])\n",
+      "mpicc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution\n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     71, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])\n",
+      "     87, Generating present(Aref[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "         88, #pragma acc loop gang /* blockIdx.x */\n",
+      "         89, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     89, Loop is parallelizable\n",
+      "    140, Generating update device(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)],rhs[nx*iy_start:nx*(iy_end-iy_start)])\n",
+      "    143, Generating present(A[:],rhs[:],Anew[:])\n",
+      "         Generating Tesla code\n",
+      "        144, #pragma acc loop gang /* blockIdx.x */\n",
+      "        145, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "        149, Generating implicit reduction(max:error)\n",
+      "    145, Loop is parallelizable\n",
+      "    157, Generating present(Anew[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        158, #pragma acc loop gang /* blockIdx.x */\n",
+      "        159, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    159, Loop is parallelizable\n",
+      "    184, Generating present(A[:])\n",
+      "         Generating Tesla code\n",
+      "        185, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    195, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    213, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task2')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
+      "Job <25664> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   1.3190 s, 2 GPUs:   0.7096 s, speedup:     1.86, efficiency:    92.94%\n",
+      "MPI time:   0.0424 s, inter GPU BW:     2.88 GiB/s\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task2')\n",
+    "!NP=2 make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
+      "Job <25665> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==78468== PGPROF is profiling process 78468, command: ./poisson2d.solution 10\n",
+      "==78469== PGPROF is profiling process 78469, command: ./poisson2d.solution 10\n",
+      "==78469== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.1.pgprof\n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   0.0226 s, 2 GPUs:   0.0129 s, speedup:     1.75, efficiency:    87.45%\n",
+      "MPI time:   0.0007 s, inter GPU BW:     1.70 GiB/s\n",
+      "==78468== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.0.pgprof\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.?.pgprof  .\n",
+      "tar -cvzf pgprof.poisson2d.Task2.solution.tar.gz poisson2d.solution.Task2.NP2.?.pgprof\n",
+      "poisson2d.solution.Task2.NP2.0.pgprof\n",
+      "poisson2d.solution.Task2.NP2.1.pgprof\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task2')\n",
+    "!NP=2 make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Scaling\n",
+    "\n",
+    "You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>GPUs</th>\n",
+       "      <th>time [s]</th>\n",
+       "      <th>speedup</th>\n",
+       "      <th>efficiency</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>1.4053</td>\n",
+       "      <td>0.93,</td>\n",
+       "      <td>93.06%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>0.7154</td>\n",
+       "      <td>1.83,</td>\n",
+       "      <td>91.56%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>4</td>\n",
+       "      <td>0.4211</td>\n",
+       "      <td>3.13,</td>\n",
+       "      <td>78.21%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>6</td>\n",
+       "      <td>0.3121</td>\n",
+       "      <td>4.20,</td>\n",
+       "      <td>70.05%</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   GPUs  time [s] speedup efficiency\n",
+       "0     1    1.4053   0.93,     93.06%\n",
+       "1     2    0.7154   1.83,     91.56%\n",
+       "2     4    0.4211   3.13,     78.21%\n",
+       "3     6    0.3121   4.20,     70.05%"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "checkdir('task2')\n",
+    "!NP=1 make run.solution | grep speedup > scale.out\n",
+    "!NP=2 make run.solution | grep speedup >> scale.out\n",
+    "!NP=4 make run.solution | grep speedup >>  scale.out\n",
+    "!NP=6 make run.solution | grep speedup >>  scale.out\n",
+    "data_frameS2 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)\n",
+    "\n",
+    "!rm scale.out\n",
+    "\n",
+    "data_frameS2b=data_frameS2.iloc[:,[5,7,10,12]].copy()\n",
+    "data_frameS2b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 3:<a name=\"solution3\"></a>\n",
+    "\n",
+    "\n",
+    "Update the boundaries first.\n",
+    "```C\n",
+    "#pragma acc parallel loop present(A,Anew)\n",
+    "for( int ix = ix_start; ix < ix_end; ix++ )\n",
+    "{\n",
+    "    A[(iy_start)*nx+ix] = Anew[(iy_start)*nx+ix];\n",
+    "    A[(iy_end-1)*nx+ix] = Anew[(iy_end-1)*nx+ix];\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "Start the interior loop asynchronously so it can overlap with the MPI communication and wait at the end for the completion.\n",
+    "```C\n",
+    "#pragma acc parallel loop present(A,Anew) async\n",
+    "for (int iy = iy_start+1; iy < iy_end-1; iy++)\n",
+    "{\n",
+    "    for( int ix = ix_start; ix < ix_end; ix++ )\n",
+    "    {\n",
+    "        A[iy*nx+ix] = Anew[iy*nx+ix];\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "//Periodic boundary conditions\n",
+    "int top    = (rank == 0) ? (size-1) : rank-1;\n",
+    "int bottom = (rank == (size-1)) ? 0 : rank+1;\n",
+    "#pragma acc host_data use_device( A )\n",
+    "{\n",
+    "    double start_mpi = MPI_Wtime();\n",
+    "    //1. Sent row iy_start (first modified row) to top receive lower boundary (iy_end) from bottom\n",
+    "    MPI_Sendrecv( A+iy_start*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top   , 0,\n",
+    "                  A+iy_end*nx+ix_start,   (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,\n",
+    "                  MPI_COMM_WORLD, MPI_STATUS_IGNORE );\n",
+    "\n",
+    "    //2. Sent row (iy_end-1) (last modified row) to bottom receive upper boundary (iy_start-1) from top\n",
+    "    MPI_Sendrecv( A+(iy_end-1)*nx+ix_start,   (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,\n",
+    "                  A+(iy_start-1)*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top   , 0,\n",
+    "                  MPI_COMM_WORLD, MPI_STATUS_IGNORE );\n",
+    "    mpi_time += MPI_Wtime() - start_mpi;\n",
+    "}\n",
+    "#pragma acc wait\n",
+    "```\n",
+    "\n",
+    "\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](/edit/C/task3/poisson2d.solution.c)\n",
+    "* [Fortran Version](/edit/FORTRAN/task3/poisson2d.solution.F03)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  You can profile the code by executing the next cell. __After__ the profiling completed download the tarball containing the profiles (`pgprof.Task2.solution.poisson2d.tar.gz`) with the File Browser. \n",
+    "Then you can import them into pgprof / nvvp using the _Import_ option in the _File_ menu. Remember to use the _Multiple processes_ option in the assistant.    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task3\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mpicc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d_serial.c -o poisson2d_serial.o\n",
+      "poisson2d_serial:\n",
+      "     36, Generating present(Anew[:],rhs[:],Aref[:])\n",
+      "     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])\n",
+      "     42, Generating Tesla code\n",
+      "         43, #pragma acc loop gang /* blockIdx.x */\n",
+      "         44, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         49, Generating implicit reduction(max:error)\n",
+      "     44, Loop is parallelizable\n",
+      "     53, Generating Tesla code\n",
+      "         54, #pragma acc loop gang /* blockIdx.x */\n",
+      "         55, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     55, Loop is parallelizable\n",
+      "     61, Generating Tesla code\n",
+      "         62, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     66, Generating Tesla code\n",
+      "         67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     78, Generating update self(Aref[:ny*nx])\n",
+      "mpicc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution\n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     71, Generating enter data create(rhs[:ny*nx],Aref[:ny*nx],A[:ny*nx],Anew[:ny*nx])\n",
+      "     87, Generating present(Aref[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "         88, #pragma acc loop gang /* blockIdx.x */\n",
+      "         89, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     89, Loop is parallelizable\n",
+      "    140, Generating update device(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)],rhs[nx*iy_start:nx*(iy_end-iy_start)])\n",
+      "    143, Generating present(A[:],rhs[:],Anew[:])\n",
+      "         Generating Tesla code\n",
+      "        144, #pragma acc loop gang /* blockIdx.x */\n",
+      "        145, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "        149, Generating implicit reduction(max:error)\n",
+      "    145, Loop is parallelizable\n",
+      "    157, Generating present(Anew[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        158, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    163, Generating present(Anew[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        164, #pragma acc loop gang /* blockIdx.x */\n",
+      "        165, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    165, Loop is parallelizable\n",
+      "    191, Generating present(A[:])\n",
+      "         Generating Tesla code\n",
+      "        192, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    202, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    220, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task3')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
+      "Job <25670> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   1.3172 s, 2 GPUs:   0.6964 s, speedup:     1.89, efficiency:    94.57%\n",
+      "MPI time:   0.0561 s, inter GPU BW:     2.17 GiB/s\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task3')\n",
+    "!NP=2 make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
+      "Job <25671> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==79190== PGPROF is profiling process 79190, command: ./poisson2d.solution 10\n",
+      "==79192== PGPROF is profiling process 79192, command: ./poisson2d.solution 10\n",
+      "==79192== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.1.pgprof\n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   0.0301 s, 2 GPUs:   0.0126 s, speedup:     2.39, efficiency:   119.53%\n",
+      "MPI time:   0.0009 s, inter GPU BW:     1.34 GiB/s\n",
+      "==79190== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.0.pgprof\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.?.pgprof  .\n",
+      "tar -cvzf pgprof.poisson2d.Task3.solution.tar.gz poisson2d.solution.Task3.NP2.?.pgprof\n",
+      "poisson2d.solution.Task3.NP2.0.pgprof\n",
+      "poisson2d.solution.Task3.NP2.1.pgprof\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task3')\n",
+    "!NP=2 make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Scaling\n",
+    "\n",
+    "You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>GPUs</th>\n",
+       "      <th>time [s]</th>\n",
+       "      <th>speedup</th>\n",
+       "      <th>efficiency</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>1.3815</td>\n",
+       "      <td>0.95,</td>\n",
+       "      <td>94.79%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>0.6968</td>\n",
+       "      <td>1.90,</td>\n",
+       "      <td>94.91%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>4</td>\n",
+       "      <td>0.3990</td>\n",
+       "      <td>3.30,</td>\n",
+       "      <td>82.56%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>6</td>\n",
+       "      <td>0.2720</td>\n",
+       "      <td>4.81,</td>\n",
+       "      <td>80.18%</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   GPUs  time [s] speedup efficiency\n",
+       "0     1    1.3815   0.95,     94.79%\n",
+       "1     2    0.6968   1.90,     94.91%\n",
+       "2     4    0.3990   3.30,     82.56%\n",
+       "3     6    0.2720   4.81,     80.18%"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "checkdir('task3')\n",
+    "!NP=1 make run.solution | grep speedup > scale.out\n",
+    "!NP=2 make run.solution | grep speedup >> scale.out\n",
+    "!NP=4 make run.solution | grep speedup >>  scale.out\n",
+    "!NP=6 make run.solution | grep speedup >>  scale.out\n",
+    "data_frameS3 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)\n",
+    "\n",
+    "!rm scale.out\n",
+    "\n",
+    "data_frameS3b=data_frameS3.iloc[:,[5,7,10,12]].copy()\n",
+    "data_frameS3b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "The overlap of compute and communication can be seen in the profiler, e.g. as shown below.\n",
+    "\n",
+    "![Solution3.png](./resources/Solution3.png)\n",
+    "\n",
+    "\n",
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 4:<a name=\"solution4\"></a>\n",
+    "\n",
+    "\n",
+    "First, include NVSHMEM headers\n",
+    "\n",
+    "```C\n",
+    "#include <nvshmem.h>\n",
+    "#include <nvshmemx.h>\n",
+    "```\n",
+    "\n",
+    "and initialize NVSHMEM with MPI\n",
+    "```C\n",
+    "MPI_Comm mpi_comm = MPI_COMM_WORLD;\n",
+    "nvshmemx_init_attr_t attr;\n",
+    "attr.mpi_comm = &mpi_comm;\n",
+    "nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);\n",
+    "```\n",
+    "    \n",
+    "Allocate device memory and map it top the host allocation for OpenACC\n",
+    "```C\n",
+    "real *d_A = (real *)nvshmem_malloc(nx * ny * sizeof(real));\n",
+    "map(A, d_A, nx * ny * sizeof(real));\n",
+    "```\n",
+    "\n",
+    "Calculate the right locations on the remote GPUs and communicate data\n",
+    "```C\n",
+    "// Periodic boundary conditions\n",
+    "int top = (rank == 0) ? (size - 1) : rank - 1;\n",
+    "int bottom = (rank == (size - 1)) ? 0 : rank + 1;\n",
+    "int iy_start_top = top * chunk_size;\n",
+    "int iy_end_top = iy_start_top + chunk_size;\n",
+    "\n",
+    "// Do not process boundaries\n",
+    "iy_start_top = max(iy_start_top, 1);\n",
+    "iy_end_top = min(iy_end_top, ny - 1);\n",
+    "\n",
+    "int iy_start_bottom = bottom * chunk_size;\n",
+    "int iy_end_bottom = iy_start_bottom + chunk_size;\n",
+    "\n",
+    "// Do not process boundaries\n",
+    "iy_start_bottom = max(iy_start_bottom, 1);\n",
+    "iy_end_bottom = min(iy_end_bottom, ny - 1);\n",
+    "\n",
+    "// Halo exchange\n",
+    "#pragma acc host_data use_device(A)\n",
+    "{\n",
+    "    double start_mpi = MPI_Wtime();\n",
+    "    nvshmem_double_put((double *)(A + iy_end_top * nx + ix_start),\n",
+    "                       (double *)(A + iy_start * nx + ix_start), (ix_end - ix_start), top);\n",
+    "    nvshmem_double_put((double *)(A + (iy_start_bottom - 1) * nx + ix_start),\n",
+    "                       (double *)(A + (iy_end - 1) * nx + ix_start), (ix_end - ix_start),\n",
+    "                       bottom);\n",
+    "    nvshmem_barrier_all();\n",
+    "    mpi_time += MPI_Wtime() - start_mpi;\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "Finally, remember to deallocate:\n",
+    "```C\n",
+    "nvshmem_free(d_A);\n",
+    "```\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](./C/task4/poisson2d.solution.c)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  You can profile the code by executing the next cell. __After__ the profiling completed download the tarball containing the profiles (`pgprof.Task4.solution.poisson2d.tar.gz`) with the File Browser. \n",
+    "Then you can import them into pgprof / nvvp using the _Import_ option in the _File_ menu. Remember to use the _Multiple processes_ option in the assistant.    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task4\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task4"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mpicxx -c -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d_serial.c -o poisson2d_serial.o\n",
+      "poisson2d_serial(int, int, double, double *, double *, int, int, const double *):\n",
+      "     37, Generating present(Anew[:],rhs[:],Aref[:])\n",
+      "     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])\n",
+      "     40, Generating Tesla code\n",
+      "         43, #pragma acc loop gang /* blockIdx.x */\n",
+      "         44, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         49, Generating implicit reduction(max:error)\n",
+      "     44, Loop is parallelizable\n",
+      "     51, Generating Tesla code\n",
+      "         54, #pragma acc loop gang /* blockIdx.x */\n",
+      "         55, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     55, Loop is parallelizable\n",
+      "     58, Generating Tesla code\n",
+      "         62, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     65, Generating Tesla code\n",
+      "         67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     77, Generating update self(Aref[:ny*nx])\n",
+      "mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/gpfs/wolf/trn003/world-shared/software/nvshmem/include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/gpfs/wolf/trn003/world-shared/software/nvshmem/lib -lnvshmem -Mcuda -lcuda -lrt \n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     90, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])\n",
+      "    101, Generating present(Aref[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        105, #pragma acc loop gang /* blockIdx.x */\n",
+      "        106, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    106, Loop is parallelizable\n",
+      "    162, Generating update device(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)],rhs[nx*iy_start:nx*(iy_end-iy_start)])\n",
+      "    163, Generating present(A[:],rhs[:],Anew[:])\n",
+      "         Generating Tesla code\n",
+      "        166, #pragma acc loop gang /* blockIdx.x */\n",
+      "        167, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "        171, Generating implicit reduction(max:error)\n",
+      "    167, Loop is parallelizable\n",
+      "    177, Generating present(Anew[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        180, #pragma acc loop gang /* blockIdx.x */\n",
+      "        181, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    181, Loop is parallelizable\n",
+      "    214, Generating present(A[:])\n",
+      "         Generating Tesla code\n",
+      "        217, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    227, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    246, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task4')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
+      "Job <25676> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   1.3188 s, 2 GPUs:   0.7398 s, speedup:     1.78, efficiency:    89.13%\n",
+      "MPI time:   0.0644 s, inter GPU BW:     1.90 GiB/s\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task4')\n",
+    "!NP=2 make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
+      "Job <25677> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==79915== PGPROF is profiling process 79915, command: ./poisson2d.solution 10\n",
+      "==79914== PGPROF is profiling process 79914, command: ./poisson2d.solution 10\n",
+      "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   0.0226 s, 2 GPUs:   0.0131 s, speedup:     1.72, efficiency:    86.13%\n",
+      "MPI time:   0.0010 s, inter GPU BW:     1.27 GiB/s\n",
+      "==79915== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.0.pgprof\n",
+      "==79914== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.1.pgprof\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.?.pgprof  .\n",
+      "tar -cvzf pgprof.poisson2d.Task4.solution.tar.gz poisson2d.solution.Task4.NP2.?.pgprof\n",
+      "poisson2d.solution.Task4.NP2.0.pgprof\n",
+      "poisson2d.solution.Task4.NP2.1.pgprof\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task4')\n",
+    "!NP=2 make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Scaling\n",
+    "\n",
+    "You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>GPUs</th>\n",
+       "      <th>time [s]</th>\n",
+       "      <th>speedup</th>\n",
+       "      <th>efficiency</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>1.3714</td>\n",
+       "      <td>0.96,</td>\n",
+       "      <td>95.91%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>0.7460</td>\n",
+       "      <td>1.76,</td>\n",
+       "      <td>88.19%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>4</td>\n",
+       "      <td>0.4706</td>\n",
+       "      <td>2.80,</td>\n",
+       "      <td>70.05%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>6</td>\n",
+       "      <td>0.3308</td>\n",
+       "      <td>3.91,</td>\n",
+       "      <td>65.18%</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   GPUs  time [s] speedup efficiency\n",
+       "0     1    1.3714   0.96,     95.91%\n",
+       "1     2    0.7460   1.76,     88.19%\n",
+       "2     4    0.4706   2.80,     70.05%\n",
+       "3     6    0.3308   3.91,     65.18%"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "checkdir('task4')\n",
+    "!NP=1 make run.solution | grep speedup > scale.out\n",
+    "!NP=2 make run.solution | grep speedup >> scale.out\n",
+    "!NP=4 make run.solution | grep speedup >>  scale.out\n",
+    "!NP=6 make run.solution | grep speedup >>  scale.out\n",
+    "data_frameS4 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)\n",
+    "\n",
+    "!rm scale.out\n",
+    "\n",
+    "data_frameS4b=data_frameS4.iloc[:,[5,7,10,12]].copy()\n",
+    "data_frameS4b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "The communication using NVSHMEM and the barrier executed as a kernel on the device can be seen in the profiler, e.g. as shown below.\n",
+    "\n",
+    "![Solution4.png](./resources/Solution4.png)\n",
+    "\n",
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 5:<a name=\"solution5\"></a>\n",
+    "\n",
+    "Basically all kernels in the `while` loop can use the async keyword. Please take a look in the solution source code. They will all use the OpenACC default async queue.\n",
+    "\n",
+    "To also place the halo exchange in the queue use:\n",
+    "```C\n",
+    "#pragma acc host_data use_device(A)\n",
+    "{\n",
+    "    nvshmemx_double_put_on_stream(\n",
+    "        (double *)(A + iy_end_top * nx + ix_start),\n",
+    "        (double *)(A + iy_start * nx + ix_start), (ix_end - ix_start), top,\n",
+    "        (cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));\n",
+    "    nvshmemx_double_put_on_stream(\n",
+    "        (double *)(A + (iy_start_bottom - 1) * nx + ix_start),\n",
+    "        (double *)(A + (iy_end - 1) * nx + ix_start), (ix_end - ix_start), bottom,\n",
+    "        (cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));\n",
+    "}\n",
+    "nvshmemx_barrier_all_on_stream((cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));\n",
+    "```\n",
+    "\n",
+    "Finally when copying out data make sure to wait on all device computation first:\n",
+    "```C\n",
+    "#pragma acc update self(A [(iy_start - 1) * nx:((iy_end - iy_start) + 2) * nx]) wait\n",
+    "```\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](/edit/C/task5/poisson2d.solution.c)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  You can profile the code by executing the next cell. __After__ the profiling completed download the tarball containing the profiles (`pgprof.Task5.solution.poisson2d.tar.gz`) with the File Browser. \n",
+    "Then you can import them into pgprof / nvvp using the _Import_ option in the _File_ menu. Remember to use the _Multiple processes_ option in the assistant.     "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task5\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task5"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mpicxx -c -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d_serial.c -o poisson2d_serial.o\n",
+      "poisson2d_serial(int, int, double, double *, double *, int, int, const double *):\n",
+      "     37, Generating present(Anew[:],rhs[:],Aref[:])\n",
+      "     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])\n",
+      "     40, Generating Tesla code\n",
+      "         43, #pragma acc loop gang /* blockIdx.x */\n",
+      "         44, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         49, Generating implicit reduction(max:error)\n",
+      "     44, Loop is parallelizable\n",
+      "     51, Generating Tesla code\n",
+      "         54, #pragma acc loop gang /* blockIdx.x */\n",
+      "         55, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     55, Loop is parallelizable\n",
+      "     58, Generating Tesla code\n",
+      "         62, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     65, Generating Tesla code\n",
+      "         67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     77, Generating update self(Aref[:ny*nx])\n",
+      "mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/ccsopen/home/mathiasw/nvshmem-master/build/include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/ccsopen/home/mathiasw/nvshmem-master/build/lib -lnvshmem -Mcuda -lcuda -lrt \n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     90, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])\n",
+      "    101, Generating present(Aref[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        105, #pragma acc loop gang /* blockIdx.x */\n",
+      "        106, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    106, Loop is parallelizable\n",
+      "    137, Generating update device(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)],rhs[nx*iy_start:nx*(iy_end-iy_start)])\n",
+      "    138, Generating present(A[:],rhs[:],Anew[:])\n",
+      "         Generating Tesla code\n",
+      "        141, #pragma acc loop gang /* blockIdx.x */\n",
+      "        142, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "        146, Generating implicit reduction(max:error)\n",
+      "    142, Loop is parallelizable\n",
+      "    154, Generating present(Anew[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        157, #pragma acc loop gang /* blockIdx.x */\n",
+      "        158, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    158, Loop is parallelizable\n",
+      "    192, Generating present(A[:])\n",
+      "         Generating Tesla code\n",
+      "        195, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    205, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    223, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task5')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
+      "Job <25682> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   1.3210 s, 2 GPUs:   0.6750 s, speedup:     1.96, efficiency:    97.86%\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task5')\n",
+    "!NP=2 make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
+      "Job <25683> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==80646== PGPROF is profiling process 80646, command: ./poisson2d.solution 10\n",
+      "==80644== PGPROF is profiling process 80644, command: ./poisson2d.solution 10\n",
+      "==80646== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.0.pgprof\n",
+      "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   0.0227 s, 2 GPUs:   0.0120 s, speedup:     1.89, efficiency:    94.65%\n",
+      "==80644== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.1.pgprof\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.?.pgprof  .\n",
+      "tar -cvzf pgprof.poisson2d.Task5.solution.tar.gz poisson2d.solution.Task5.NP2.?.pgprof\n",
+      "poisson2d.solution.Task5.NP2.0.pgprof\n",
+      "poisson2d.solution.Task5.NP2.1.pgprof\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task5')\n",
+    "!NP=2 make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Scaling\n",
+    "\n",
+    "You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>GPUs</th>\n",
+       "      <th>time [s]</th>\n",
+       "      <th>speedup</th>\n",
+       "      <th>efficiency</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>1.3004</td>\n",
+       "      <td>1.01,</td>\n",
+       "      <td>101.04%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>0.6705</td>\n",
+       "      <td>1.95,</td>\n",
+       "      <td>97.67%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>4</td>\n",
+       "      <td>0.3879</td>\n",
+       "      <td>3.41,</td>\n",
+       "      <td>85.14%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>6</td>\n",
+       "      <td>0.2745</td>\n",
+       "      <td>4.81,</td>\n",
+       "      <td>80.25%</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   GPUs  time [s] speedup efficiency\n",
+       "0     1    1.3004   1.01,    101.04%\n",
+       "1     2    0.6705   1.95,     97.67%\n",
+       "2     4    0.3879   3.41,     85.14%\n",
+       "3     6    0.2745   4.81,     80.25%"
+      ]
+     },
+     "execution_count": 31,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "checkdir('task5')\n",
+    "!NP=1 make run.solution | grep speedup > scale.out\n",
+    "!NP=2 make run.solution | grep speedup >> scale.out\n",
+    "!NP=4 make run.solution | grep speedup >>  scale.out\n",
+    "!NP=6 make run.solution | grep speedup >>  scale.out\n",
+    "data_frameS5 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)\n",
+    "\n",
+    "!rm scale.out\n",
+    "\n",
+    "data_frameS5b=data_frameS5.iloc[:,[5,7,10,12]].copy()\n",
+    "data_frameS5b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "The asynchronous execution and execution in the same stream can be seen in the profiler, e.g. as shown below.\n",
+    "\n",
+    "![Solution5.png](./resources/Solution5.png)\n",
+    "\n",
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "## Solution 6:<a name=\"solution6\"></a>\n",
+    "\n",
+    "\n",
+    "The most important part here is to get an `nvshmem_ptr` pointing to the symmetric `d_A` allocation of your top and bottom neighbor.\n",
+    "```C\n",
+    "real * restrict d_Atop = (real *)nvshmem_ptr(d_A, top);\n",
+    "real * restrict d_Abottom = (real *)nvshmem_ptr(d_A, bottom);\n",
+    "```\n",
+    "\n",
+    "When updating `A` from Anew make sure to also update `A` on your top and bottom neighbor if you are at the boundary:\n",
+    "```C\n",
+    "#pragma acc parallel loop present(A, Anew) deviceptr(d_Atop, d_Abottom) async\n",
+    "for (int iy = iy_start; iy < iy_end; iy++) {\n",
+    "    for (int ix = ix_start; ix < ix_end; ix++) {\n",
+    "        A[iy * nx + ix] = Anew[iy * nx + ix];\n",
+    "        if(iy == iy_start){// this also needs to go to the lower halo region of my upper neighbor\n",
+    "            d_Atop[iy_end_top * nx + ix] = Anew[iy * nx + ix];\n",
+    "        }\n",
+    "        if(iy == iy_end -1){// this also needs to go to the upper halo region of my bottom neighbor\n",
+    "            d_Abottom[(iy_start_bottom - 1) * nx + ix]  =  Anew[iy * nx + ix];\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "We can then remove the explicit `nvhsmem_put` calls on completely. But remember to still keep the barrier.\n",
+    "```C\n",
+    "nvshmemx_barrier_all_on_stream((cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));\n",
+    "````\n",
+    "\n",
+    "\n",
+    "\n",
+    "#### Code\n",
+    "\n",
+    "* [C Version](./C/task6/poisson2d.solution.c)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Compiling, Running and Profiling\n",
+    "\n",
+    "You can compile, run and profile the solution with the next cells.  You can profile the code by executing the next cell. __After__ the profiling completed download the tarball containing the profiles (`pgprof.Task6.solution.poisson2d.tar.gz`) with the File Browser. \n",
+    "Then you can import them into pgprof / nvvp using the _Import_ option in the _File_ menu. Remember to use the _Multiple processes_ option in the assistant.   "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/autofs/nccsopen-svm1_home/mathiasw/sc19-tutorial-openpower/4-GPU/HandsOn/Solution/C/task6\n"
+     ]
+    }
+   ],
+   "source": [
+    "%cd $basedir/task6"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mpicxx -c -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned poisson2d_serial.c -o poisson2d_serial.o\n",
+      "poisson2d_serial(int, int, double, double *, double *, int, int, const double *):\n",
+      "     37, Generating present(Anew[:],rhs[:],Aref[:])\n",
+      "     39, Generating update device(rhs[:ny*nx],Aref[:ny*nx])\n",
+      "     40, Generating Tesla code\n",
+      "         43, #pragma acc loop gang /* blockIdx.x */\n",
+      "         44, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "         49, Generating implicit reduction(max:error)\n",
+      "     44, Loop is parallelizable\n",
+      "     51, Generating Tesla code\n",
+      "         54, #pragma acc loop gang /* blockIdx.x */\n",
+      "         55, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "     55, Loop is parallelizable\n",
+      "     58, Generating Tesla code\n",
+      "         62, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     65, Generating Tesla code\n",
+      "         67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "     77, Generating update self(Aref[:ny*nx])\n",
+      "mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/ccsopen/home/mathiasw/nvshmem-master/build/include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/ccsopen/home/mathiasw/nvshmem-master/build/lib -lnvshmem -Mcuda -lcuda -lrt \n",
+      "poisson2d.solution.c:\n",
+      "main:\n",
+      "     95, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])\n",
+      "    106, Generating present(Aref[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        110, #pragma acc loop gang /* blockIdx.x */\n",
+      "        111, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    111, Loop is parallelizable\n",
+      "    158, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    159, Generating present(A[:],rhs[:],Anew[:])\n",
+      "         Generating Tesla code\n",
+      "        162, #pragma acc loop gang /* blockIdx.x */\n",
+      "        163, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "        167, Generating implicit reduction(max:error)\n",
+      "    163, Loop is parallelizable\n",
+      "    174, Generating present(Anew[:],A[:])\n",
+      "         Generating Tesla code\n",
+      "        177, #pragma acc loop gang /* blockIdx.x */\n",
+      "        179, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    179, Loop is parallelizable\n",
+      "    190, Generating present(A[:])\n",
+      "         Generating Tesla code\n",
+      "        193, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    203, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    219, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task6')\n",
+    "!make poisson2d.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
+      "Job <25688> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "  100, 0.249940\n",
+      "  200, 0.249880\n",
+      "  300, 0.249821\n",
+      "  400, 0.249761\n",
+      "  500, 0.249702\n",
+      "  600, 0.249642\n",
+      "  700, 0.249583\n",
+      "  800, 0.249524\n",
+      "  900, 0.249464\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   1.3196 s, 2 GPUs:   0.6641 s, speedup:     1.99, efficiency:    99.34%\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task6')\n",
+    "!NP=2 make run.solution"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
+      "Job <25689> is submitted to default queue <batch>.\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "==81382== PGPROF is profiling process 81382, command: ./poisson2d.solution 10\n",
+      "==81383== PGPROF is profiling process 81383, command: ./poisson2d.solution 10\n",
+      "==81382== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.1.pgprof\n",
+      "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
+      "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
+      "Calculate reference solution and time serial execution.\n",
+      "    0, 0.250000\n",
+      "Parallel execution.\n",
+      "    0, 0.250000\n",
+      "Num GPUs: 2.\n",
+      "4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0118 s, speedup:     1.91, efficiency:    95.50%\n",
+      "==81383== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.0.pgprof\n",
+      "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.?.pgprof  .\n",
+      "tar -cvzf pgprof.poisson2d.Task6.solution.tar.gz poisson2d.solution.Task6.NP2.?.pgprof\n",
+      "poisson2d.solution.Task6.NP2.0.pgprof\n",
+      "poisson2d.solution.Task6.NP2.1.pgprof\n"
+     ]
+    }
+   ],
+   "source": [
+    "checkdir('task6')\n",
+    "!NP=2 make profile.solution"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "#### Scaling\n",
+    "\n",
+    "You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {
+    "exercise": "solution"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n",
+      "<<Waiting for dispatch ...>>\n",
+      "<<Starting on login1>>\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>GPUs</th>\n",
+       "      <th>time [s]</th>\n",
+       "      <th>speedup</th>\n",
+       "      <th>efficiency</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1</td>\n",
+       "      <td>1.2964</td>\n",
+       "      <td>1.01,</td>\n",
+       "      <td>101.26%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2</td>\n",
+       "      <td>0.6714</td>\n",
+       "      <td>1.94,</td>\n",
+       "      <td>96.87%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>4</td>\n",
+       "      <td>0.3810</td>\n",
+       "      <td>3.46,</td>\n",
+       "      <td>86.47%</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>6</td>\n",
+       "      <td>0.2641</td>\n",
+       "      <td>4.87,</td>\n",
+       "      <td>81.16%</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   GPUs  time [s] speedup efficiency\n",
+       "0     1    1.2964   1.01,    101.26%\n",
+       "1     2    0.6714   1.94,     96.87%\n",
+       "2     4    0.3810   3.46,     86.47%\n",
+       "3     6    0.2641   4.87,     81.16%"
+      ]
+     },
+     "execution_count": 36,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "checkdir('task6')\n",
+    "!NP=1 make run.solution | grep speedup > scale.out\n",
+    "!NP=2 make run.solution | grep speedup >> scale.out\n",
+    "!NP=4 make run.solution | grep speedup >>  scale.out\n",
+    "!NP=6 make run.solution | grep speedup >>  scale.out\n",
+    "data_frameS5 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)\n",
+    "\n",
+    "!rm scale.out\n",
+    "\n",
+    "data_frameS5b=data_frameS5.iloc[:,[5,7,10,12]].copy()\n",
+    "data_frameS5b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "exercise": "solution"
+   },
+   "source": [
+    "The missing of device copies can be seen in the profiler, e.g. as shown below. There are only kernels running mostly back-to-back, only interrupted by the global reduction.\n",
+    "\n",
+    "![Solution6.png](./resources/Solution6.png)\n",
+    "\n",
+    "[Back to Top](#top)\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "# Survey<a name=\"survey\"></a>\n",
+    "\n",
+    "Please remember to take some time and fill out the surveyhttp://bit.ly/sc19-eval.\n",
+    "\n",
+    "![eval.png](./resources/eval.png)"
+   ]
+  }
+ ],
+ "metadata": {
+  "celltoolbar": "Edit Metadata",
+  "kernelspec": {
+   "display_name": "SC19 S4 (Python 3 + GPU)",
+   "language": "python",
+   "name": "session-4"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.0"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": false,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/4-GPU/HandsOn/HandsOnGPUProgramming.html b/4-GPU/HandsOn/HandsOnGPUProgramming.html
index 50711ae98ecb3e5a96d001a8e5c3c70d609a8d3f..3d827d9559249ae1aaa3fbfeb97da80d27264b71 100644
--- a/4-GPU/HandsOn/HandsOnGPUProgramming.html
+++ b/4-GPU/HandsOn/HandsOnGPUProgramming.html
@@ -13087,19 +13087,19 @@ div#notebook {
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="Read-me-first">Read me first<a class="anchor-link" href="#Read-me-first">&#182;</a></h3><p>This tutorial is primarily designed to be executed as a <em>jupyter</em> notebook. However, everything can also be done using an <em>ssh</em> connection to <em>ascent.olcf.ornl.gov</em> in your terminal.</p>
-<h4 id="Jupyter-Lab-execution">Jupyter Lab execution<a class="anchor-link" href="#Jupyter-Lab-execution">&#182;</a></h4><p>When using jupyter this notebook will guide you through the step. Note that if you execute a cell multiple times while optimizing the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the <em>edit</em> menu above.</p>
-<p>You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.</p>
+<h4 id="Jupyter-Lab-execution">Jupyter Lab execution<a class="anchor-link" href="#Jupyter-Lab-execution">&#182;</a></h4><p>When using jupyter this notebook will guide you through the tasks. Note that if you execute a cell multiple times while optimizing the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the <em>edit</em> menu above.</p>
+<p>You can always use the file browser to locate the the source files you will need to edit as well as the profiling output you need to open locally.</p>
 <p>If you want you also can get a terminal in your browser by following the <em>File -&gt; New -&gt; Terminal</em> in the Jupyter Lab menu bar.</p>
 <h4 id="Terminal-fallback">Terminal fallback<a class="anchor-link" href="#Terminal-fallback">&#182;</a></h4><p>The tasks are placed in directories named <code>[C/FORTRAN]/task[0-6]</code>.<br>
 <em>Note: The tasks using NVHSMEM (4-6) are only available in C.</em></p>
-<p>The files you will need to edit are always the <code>poisson2d.(C|F03)</code> files.</p>
-<p>The makefile targets execute everything to compile, run and profile the code. Please take a look at the cells containing the make calls as a guide.</p>
-<p>The outputs of profiling runs be placed in the working directory of the current task and are named like <code>*.pgprof</code> or <code>pgprof.*.tar.gz</code> in case of multiple files. You can use <em>scp/sftp</em> to transfer files to your machine and for viewing them in pgprof/nvprof.</p>
+<p>The files you will need to edit are always the <code>poisson2d.(c|F03)</code> files.</p>
+<p>The makefile targets execute everything to compile, run and profile the code. Please take a look at the cells containing the make calls as guidane.</p>
+<p>The outputs of profiling runs will be placed in the working directory of the current task and are named like <code>*.pgprof</code> or <code>pgprof.*.tar.gz</code> in case of multiple files. You can use <em>scp/sftp</em> to transfer files to your machine and for viewing them in pgprof/nvprof.</p>
 <h4 id="Viewing-profiles-in-the-NVIDIA-Visual-Profiler-/-PGI-Profiler">Viewing profiles in the NVIDIA Visual Profiler / PGI Profiler<a class="anchor-link" href="#Viewing-profiles-in-the-NVIDIA-Visual-Profiler-/-PGI-Profiler">&#182;</a></h4><p>The profiles generated <em>pgprof / nvprof</em> should be viewed on your local machine. You can install the PGI Community Edition (pgprof) or the NVIDIA CUDA Toolkit on your notebook (Windows, Mac, Linux). You don't need an NVIDIA GPU in your machine to use the profiler GUI.</p>
 <p>There are USB Sticks in the room that contain the installers for various platforms, but for reference you can also download it from:</p>
 <ul>
-<li><a href="https://developer.nvidia.com/cuda-downloads">NVIDIA CUDA Toolkit</a></li>
-<li><a href="https://www.pgroup.com/products/community.htm">PGI Community Edition</a></li>
+<li><a href="https://developer.nvidia.com/cuda-downloads">NVIDIA CUDA Toolkit</a> </li>
+<li><a href="https://www.pgroup.com/products/community.htm">PGI Community Edition</a> <em>For Windows and Linux only, there is no GPU support for Mac</em></li>
 </ul>
 <p>After downloading the profiler output (more infos below) follow the steps outlined in:</p>
 <ul>
@@ -13162,8 +13162,10 @@ div#notebook {
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h1 id="Tasks">Tasks<a name="top" /><a class="anchor-link" href="#Tasks">&#182;</a></h1><p>This session comes with multiple tasks. All tasks are available in C or FORTRAN and can be found in the <code>[C|Fortan]/task[0-3]</code> subdirectories. There you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.</p>
-<p>Please choose from the task below. <em>If you want to go for the advanced NVSHMEM tasks you should complete Task 2 but can skip Task 3 (or postpone it until the end).</em></p>
+<h1 id="Tasks">Tasks<a name="top" /><a class="anchor-link" href="#Tasks">&#182;</a></h1><p>This session includes multiple tasks. The first tasks are available in C or FORTRAN and can be found in the <code>[C|Fortan]/task[0-3]</code> subdirectories. The <em>advanced / optional</em> NVSHMEM tasks are available only in C and located in the <code>C/task[4-6]</code> directories.</p>
+<p><em>If you want to go for the advanced NVSHMEM tasks you should complete Task 2 but can skip Task 3 (or postpone it until the end).</em></p>
+<p>In any case you will also Makefiles that are set up so that you can compile and submit all necessary tasks.</p>
+<p>Please choose from the task below.</p>
 <h3 id="GPU-Programming">GPU Programming<a class="anchor-link" href="#GPU-Programming">&#182;</a></h3><ul>
 <li><p><a href="#task0">Task 0</a> Accelerate a CPU Jacobi solver with OpenACC relying on Unified Memory for data movement using <code>–ta=tesla:managed</code></p>
 </li>
@@ -13182,9 +13184,11 @@ div#notebook {
 </li>
 <li><p><a href="#task5">Task 5</a> Put NVSHMEM calls on stream to hide API calls and GPU/CPU synchronization</p>
 </li>
+<li><p><a href="#task6">Task 6</a> Use NVSHMEM to fine-grained also update the halo region of your neighbors</p>
+</li>
 </ul>
 <h3 id="Survey">Survey<a class="anchor-link" href="#Survey">&#182;</a></h3><ul>
-<li><a href="#survey">Suvery</a> Please remember to take the survey !</li>
+<li>Please remember to take the <a href="#survey">suvery</a> !</li>
 </ul>
 
 </div>
@@ -13228,7 +13232,7 @@ same as above for the solution (e.g. <code>make poisson2d.solution</code> or <co
 <p><em>Look for</em> <strong>TODOs</strong> in the code.</p>
 <p>Look at the output generated by the PGI compiler (enabled by the <code>-Minfo=accel</code> option) to see how the compiler parallelizes the code.</p>
 <h4 id="Code">Code<a class="anchor-link" href="#Code">&#182;</a></h4><p>You can open the source code either in a terminal in an editor. Navigate to <code>(C|Fortran)/task0/</code> and open <code>poisson2d.c</code> in a editor of your choice.</p>
-<p>If your are using the jupyter approach by following the link (for the language of your choice), This will open the source code in an editor in a new browser tab/window.</p>
+<p>If your are using the jupyter approach by following the link (for the language of your choice). This will open the source code in an editor in a new browser tab/window.</p>
 <ul>
 <li><a href="./C/task0/poisson2d.c">C Version</a></li>
 <li><a href=".FORTAN/task0/poisson2d.F03">Fortran Version</a></li>
@@ -13751,14 +13755,16 @@ Then you can import them into pgprof / nvvp using the <em>Import</em> option in
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <hr>
-<h1 id="Tasks-using-NVSHMEM">Tasks using NVSHMEM<a class="anchor-link" href="#Tasks-using-NVSHMEM">&#182;</a></h1><p><strong>The following tasks are using NVSHMEM instead of MPI. NVSHMEM is currently available as early access software. Please read the following carefully before starting these tasks.</strong></p>
+<h1 id="Tasks-using-NVSHMEM">Tasks using NVSHMEM<a class="anchor-link" href="#Tasks-using-NVSHMEM">&#182;</a></h1><p><strong>The following tasks are using NVSHMEM instead of MPI</strong>. 
+<strong>NVSHMEM is currently available as early access software.</strong> 
+<strong>Please read the following carefully before starting these tasks.</strong></p>
 <ul>
 <li><em>NVSHMEM early access 0.3.2</em> is installed on Ascent. It is provided under the license in <a href="./LICENSE_NVSHMEM.md">LICENSE_NVSHMEM.md</a>.</li>
 <li>If you want to continue using the NVHSMEM early access version beyond this tutorial you need to apply for early access at <a href="https://developer.nvidia.com/nvshmem">https://developer.nvidia.com/nvshmem</a></li>
 </ul>
 <hr>
-<p>NVSHMEM enables efficient communication among GPUs.It supports an API for direct communication among GPUs, either initiated by the CPU or by GPUs inside of compute kernels. Inside compute kernels, NVSHMEM also supports direct load/store accesses to remote memory over PCIe or NVLink. The ability to initiate communication from inside kernels eliminates GPU-host-synchronization and associated overheads. It can also benefit from latency tolerance mechanisms available within GPUs. The tasks illustrate that progressing from an MPI-only app to an app that uses NVSHMEM can be straightforward.</p>
-<p><strong>NOTE</strong>: Covering all feature of NVSHMEM, incuding communication calls in kernels, is not easily accessible through OpenACC and also exceed the scope of this tutorial. However, the OpenACC examples should give you a basic introduction to NVSHMEM.</p>
+<p>NVSHMEM enables efficient communication among GPUs. It supports an API for direct communication among GPUs, either initiated by the CPU or by GPUs inside of compute kernels. Inside compute kernels, NVSHMEM also supports direct load/store accesses to remote memory over PCIe or NVLink. The ability to initiate communication from inside kernels eliminates GPU-host-synchronization and associated overheads. It can also benefit from latency tolerance mechanisms available within GPUs. The tasks illustrate that progressing from an MPI-only app to an app that uses NVSHMEM can be straightforward.</p>
+<p><strong>NOTE</strong>: Covering all feature of NVSHMEM, including communication calls in kernels, is not easily accessible through OpenACC and also exceed the scope of this tutorial. However, the OpenACC examples should give you a basic introduction to NVSHMEM.</p>
 <p>You can check the developer guide and the other presentations</p>
 <h4 id="References">References<a class="anchor-link" href="#References">&#182;</a></h4><ol>
 <li><a href="http://www.openacc.org">http://www.openacc.org</a></li>
@@ -13783,8 +13789,8 @@ Then you can import them into pgprof / nvvp using the <em>Import</em> option in
 <li>Replace <code>MPI_Sendrecv</code> calls with SHMEM calls (<code>nvshmem_double_put</code>)</li>
 <li>Insert NVSHMEM barriers to ensure correct execution (<code>nvshmem_barrier_all</code>)</li>
 </ul>
-<p><strong>For interoperability with OpenSHMEM NVSHMEM can also be set up to prefix all calls to NVHSMEM with <code>nv</code>. Please make sure to use these version, e.g. use <code>nvshmem_barrier</code> instead of <code>shmem_barrier</code>. The developer guide mostly uses the unprefixed versions.</strong></p>
-<p><em>Look for</em> <strong>TODOs</strong>.</p>
+<p><strong>For interoperability with OpenSHMEM NVSHMEM can also be set up to prefix all calls to NVHSMEM with <code>nv</code>. Please make sure to use these version, e.g. use <code>nvshmem_barrier</code> instead of <code>shmem_barrier</code>. The developer guide mostly uses the not prefixed versions.</strong></p>
+<p><em>Look for</em> <strong>TODOs</strong> in the code.</p>
 <h4 id="Code">Code<a class="anchor-link" href="#Code">&#182;</a></h4><ul>
 <li><a href="./C/task4/poisson2d.c">C Version</a></li>
 </ul>
@@ -13921,10 +13927,10 @@ Then you can import them into pgprof / nvvp using the <em>Import</em> option in
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-5:-Make-communication-asynchronous">Task 5: <a name="task5" />Make communication asynchronous<a class="anchor-link" href="#Task-5:-Make-communication-asynchronous">&#182;</a></h2><p>NVSHMEM allows you to put communications in <em>CUDA streams / OpenACC async queues</em>. This allows the CPU already set up communication and kernel launches while the GPU is still communicationg, effectively hiding the time spend in API calls.</p>
+<h2 id="Task-5:-Make-communication-asynchronous">Task 5: <a name="task5" />Make communication asynchronous<a class="anchor-link" href="#Task-5:-Make-communication-asynchronous">&#182;</a></h2><p>NVSHMEM allows you to put communications in <em>CUDA streams / OpenACC async queues</em>. This allows the CPU already set up communication and kernel launches while the GPU is still communicating, effectively hiding the time spend in API calls.</p>
 <p>To do this you need to:</p>
 <ul>
-<li>use the <code>async</code> and <code>wait</code> keywords in the OpenACC pragmas to excute the kernels asynchronously in the OpenACC default queu</li>
+<li>use the <code>async</code> and <code>wait</code> keywords in the OpenACC pragmas to excute the kernels asynchronously in the OpenACC default queue</li>
 <li>replace <code>nvshmem_double_put</code> calls with the <code>nvhsmemx_double_put_on_stream</code> version.<br>
 use <code>use acc_get_cuda_stream</code> and <code>acc_get_default_async</code> to get the <code>cudaStream_t cudaStream</code> corresponding to the OpenACC default async queue.</li>
 <li>make sure to synchronize before copying the data back to the CPU</li>
@@ -14068,7 +14074,7 @@ Then you can import them into pgprof / nvvp using the <em>Import</em> option in
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Task-6:-Use-direct-load/store-to-remote-memory">Task 6: <a name="task5" />Use direct load/store to remote memory<a class="anchor-link" href="#Task-6:-Use-direct-load/store-to-remote-memory">&#182;</a></h2><p>NVSHMEM allows you to put communications in the GPU kernels. Howerver, the <code>nvhsmem_put / nvshmem_get</code> calls are not easily avilable in OpenACC kernels. However, for <em>intranode</em> communication when all GPUs can use P2P (as in the nodes in Ascent and Summit) you can get a pointer to a remote GPUs memory using <code>nvshmem_ptr</code>.</p>
+<h2 id="Task-6:-Use-direct-load/store-to-remote-memory">Task 6: <a name="task6" />Use direct load/store to remote memory<a class="anchor-link" href="#Task-6:-Use-direct-load/store-to-remote-memory">&#182;</a></h2><p>NVSHMEM allows you to put communications in the GPU kernels. However, the <code>nvhsmem_put / nvshmem_get</code> calls are not easily available in OpenACC kernels. However, for <em>intranode</em> communication when all GPUs can use P2P (as in the nodes in Ascent and Summit) you can get a pointer to a remote GPUs memory using <code>nvshmem_ptr</code>.</p>
 <p>To do this you need to:</p>
 <ul>
 <li>use the <code>nvshmem_ptr</code> to get pointers to your neighboring (top/bottom) <code>d_A</code> allocation</li>
diff --git a/4-GPU/HandsOn/HandsOnGPUProgramming.ipynb b/4-GPU/HandsOn/HandsOnGPUProgramming.ipynb
index 04cd39fca256256d1c1641def972b041b1a33a21..365384bf87435559998779b3bee68daf674f1138 100644
--- a/4-GPU/HandsOn/HandsOnGPUProgramming.ipynb
+++ b/4-GPU/HandsOn/HandsOnGPUProgramming.ipynb
@@ -22,9 +22,9 @@
     "\n",
     "#### Jupyter Lab execution\n",
     "\n",
-    "When using jupyter this notebook will guide you through the step. Note that if you execute a cell multiple times while optimizing the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n",
+    "When using jupyter this notebook will guide you through the tasks. Note that if you execute a cell multiple times while optimizing the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the _edit_ menu above.\n",
     "\n",
-    "You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.\n",
+    "You can always use the file browser to locate the the source files you will need to edit as well as the profiling output you need to open locally.\n",
     "\n",
     "If you want you also can get a terminal in your browser by following the *File -> New -> Terminal* in the Jupyter Lab menu bar.\n",
     "\n",
@@ -32,19 +32,19 @@
     "The tasks are placed in directories named `[C/FORTRAN]/task[0-6]`.<br>\n",
     "*Note: The tasks using NVHSMEM (4-6) are only available in C.* \n",
     "\n",
-    "The files you will need to edit are always the `poisson2d.(C|F03)` files.\n",
+    "The files you will need to edit are always the `poisson2d.(c|F03)` files.\n",
     "\n",
-    "The makefile targets execute everything to compile, run and profile the code. Please take a look at the cells containing the make calls as a guide.\n",
+    "The makefile targets execute everything to compile, run and profile the code. Please take a look at the cells containing the make calls as guidane.\n",
     "\n",
-    "The outputs of profiling runs be placed in the working directory of the current task and are named like `*.pgprof` or `pgprof.*.tar.gz` in case of multiple files. You can use _scp/sftp_ to transfer files to your machine and for viewing them in pgprof/nvprof.\n",
+    "The outputs of profiling runs will be placed in the working directory of the current task and are named like `*.pgprof` or `pgprof.*.tar.gz` in case of multiple files. You can use _scp/sftp_ to transfer files to your machine and for viewing them in pgprof/nvprof.\n",
     "\n",
     "#### Viewing profiles in the NVIDIA Visual Profiler / PGI Profiler\n",
     "\n",
     "The profiles generated _pgprof / nvprof_ should be viewed on your local machine. You can install the PGI Community Edition (pgprof) or the NVIDIA CUDA Toolkit on your notebook (Windows, Mac, Linux). You don't need an NVIDIA GPU in your machine to use the profiler GUI.\n",
     "\n",
     "There are USB Sticks in the room that contain the installers for various platforms, but for reference you can also download it from:\n",
-    "* [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)\n",
-    "* [PGI Community Edition](https://www.pgroup.com/products/community.htm)\n",
+    "* [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-downloads) \n",
+    "* [PGI Community Edition](https://www.pgroup.com/products/community.htm) _For Windows and Linux only, there is no GPU support for Mac_\n",
     "\n",
     "After downloading the profiler output (more infos below) follow the steps outlined in:\n",
     "* [Import Session](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-session)\n",
@@ -106,9 +106,14 @@
    "source": [
     "# Tasks<a name=\"top\"></a>\n",
     "\n",
-    "This session comes with multiple tasks. All tasks are available in C or FORTRAN and can be found in the `[C|Fortan]/task[0-3]` subdirectories. There you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.\n",
+    "This session includes multiple tasks. The first tasks are available in C or FORTRAN and can be found in the `[C|Fortan]/task[0-3]` subdirectories. The *advanced / optional* NVSHMEM tasks are available only in C and located in the `C/task[4-6]` directories. \n",
+    "\n",
+    "*If you want to go for the advanced NVSHMEM tasks you should complete Task 2 but can skip Task 3 (or postpone it until the end).*\n",
+    "\n",
+    "In any case you will also Makefiles that are set up so that you can compile and submit all necessary tasks.\n",
+    "\n",
+    "Please choose from the task below. \n",
     "\n",
-    "Please choose from the task below. *If you want to go for the advanced NVSHMEM tasks you should complete Task 2 but can skip Task 3 (or postpone it until the end).*\n",
     "\n",
     "\n",
     "### GPU Programming\n",
@@ -132,10 +137,12 @@
     "\n",
     "* [Task 5](#task5) Put NVSHMEM calls on stream to hide API calls and GPU/CPU synchronization  \n",
     "\n",
+    "* [Task 6](#task6) Use NVSHMEM to fine-grained also update the halo region of your neighbors  \n",
+    "\n",
     "\n",
     "### Survey\n",
     " \n",
-    " * [Suvery](#survey) Please remember to take the survey !"
+    " * Please remember to take the [suvery](#survey) !"
    ]
   },
   {
@@ -195,7 +202,7 @@
     "\n",
     "You can open the source code either in a terminal in an editor. Navigate to `(C|Fortran)/task0/` and open `poisson2d.c` in a editor of your choice.\n",
     "\n",
-    "If your are using the jupyter approach by following the link (for the language of your choice), This will open the source code in an editor in a new browser tab/window.\n",
+    "If your are using the jupyter approach by following the link (for the language of your choice). This will open the source code in an editor in a new browser tab/window.\n",
     "\n",
     "* [C Version](./C/task0/poisson2d.c)\n",
     "* [Fortran Version](.FORTAN/task0/poisson2d.F03)\n",
@@ -758,16 +765,20 @@
     "---\n",
     "# Tasks using NVSHMEM \n",
     "\n",
-    "**The following tasks are using NVSHMEM instead of MPI. NVSHMEM is currently available as early access software. Please read the following carefully before starting these tasks.**\n",
+    "\n",
+    "**The following tasks are using NVSHMEM instead of MPI**. \n",
+    "**NVSHMEM is currently available as early access software.** \n",
+    "**Please read the following carefully before starting these tasks.**\n",
+    "\n",
     "\n",
     "* *NVSHMEM early access 0.3.2* is installed on Ascent. It is provided under the license in [LICENSE_NVSHMEM.md](./LICENSE_NVSHMEM.md).\n",
     "* If you want to continue using the NVHSMEM early access version beyond this tutorial you need to apply for early access at https://developer.nvidia.com/nvshmem\n",
     "\n",
     "---\n",
     "\n",
-    "NVSHMEM enables efficient communication among GPUs.It supports an API for direct communication among GPUs, either initiated by the CPU or by GPUs inside of compute kernels. Inside compute kernels, NVSHMEM also supports direct load/store accesses to remote memory over PCIe or NVLink. The ability to initiate communication from inside kernels eliminates GPU-host-synchronization and associated overheads. It can also benefit from latency tolerance mechanisms available within GPUs. The tasks illustrate that progressing from an MPI-only app to an app that uses NVSHMEM can be straightforward.\n",
+    "NVSHMEM enables efficient communication among GPUs. It supports an API for direct communication among GPUs, either initiated by the CPU or by GPUs inside of compute kernels. Inside compute kernels, NVSHMEM also supports direct load/store accesses to remote memory over PCIe or NVLink. The ability to initiate communication from inside kernels eliminates GPU-host-synchronization and associated overheads. It can also benefit from latency tolerance mechanisms available within GPUs. The tasks illustrate that progressing from an MPI-only app to an app that uses NVSHMEM can be straightforward.\n",
     "\n",
-    "**NOTE**: Covering all feature of NVSHMEM, incuding communication calls in kernels, is not easily accessible through OpenACC and also exceed the scope of this tutorial. However, the OpenACC examples should give you a basic introduction to NVSHMEM.\n",
+    "**NOTE**: Covering all feature of NVSHMEM, including communication calls in kernels, is not easily accessible through OpenACC and also exceed the scope of this tutorial. However, the OpenACC examples should give you a basic introduction to NVSHMEM.\n",
     "\n",
     "You can check the developer guide and the other presentations \n",
     "\n",
@@ -799,9 +810,9 @@
     "\n",
     "\n",
     "\n",
-    "**For interoperability with OpenSHMEM NVSHMEM can also be set up to prefix all calls to NVHSMEM with `nv`. Please make sure to use these version, e.g. use `nvshmem_barrier` instead of `shmem_barrier`. The developer guide mostly uses the unprefixed versions.**\n",
+    "**For interoperability with OpenSHMEM NVSHMEM can also be set up to prefix all calls to NVHSMEM with `nv`. Please make sure to use these version, e.g. use `nvshmem_barrier` instead of `shmem_barrier`. The developer guide mostly uses the not prefixed versions.**\n",
     "\n",
-    "_Look for_ __TODOs__.\n",
+    "_Look for_ __TODOs__ in the code.\n",
     "\n",
     "\n",
     "\n",
@@ -949,10 +960,10 @@
    "source": [
     "## Task 5: <a name=\"task5\"></a>Make communication asynchronous\n",
     "\n",
-    "NVSHMEM allows you to put communications in *CUDA streams / OpenACC async queues*. This allows the CPU already set up communication and kernel launches while the GPU is still communicationg, effectively hiding the time spend in API calls.\n",
+    "NVSHMEM allows you to put communications in *CUDA streams / OpenACC async queues*. This allows the CPU already set up communication and kernel launches while the GPU is still communicating, effectively hiding the time spend in API calls.\n",
     "\n",
     "To do this you need to:\n",
-    "* use the `async` and `wait` keywords in the OpenACC pragmas to excute the kernels asynchronously in the OpenACC default queu\n",
+    "* use the `async` and `wait` keywords in the OpenACC pragmas to excute the kernels asynchronously in the OpenACC default queue\n",
     "* replace `nvshmem_double_put` calls with the `nvhsmemx_double_put_on_stream` version.<br>\n",
     "  use `use acc_get_cuda_stream` and `acc_get_default_async` to get the `cudaStream_t cudaStream` corresponding to the OpenACC default async queue.\n",
     "* make sure to synchronize before copying the data back to the CPU\n",
@@ -1106,9 +1117,9 @@
     "exercise": "task"
    },
    "source": [
-    "## Task 6: <a name=\"task5\"></a>Use direct load/store to remote memory\n",
+    "## Task 6: <a name=\"task6\"></a>Use direct load/store to remote memory\n",
     "\n",
-    "NVSHMEM allows you to put communications in the GPU kernels. Howerver, the `nvhsmem_put / nvshmem_get` calls are not easily avilable in OpenACC kernels. However, for *intranode* communication when all GPUs can use P2P (as in the nodes in Ascent and Summit) you can get a pointer to a remote GPUs memory using `nvshmem_ptr`.\n",
+    "NVSHMEM allows you to put communications in the GPU kernels. However, the `nvhsmem_put / nvshmem_get` calls are not easily available in OpenACC kernels. However, for *intranode* communication when all GPUs can use P2P (as in the nodes in Ascent and Summit) you can get a pointer to a remote GPUs memory using `nvshmem_ptr`.\n",
     "\n",
     "To do this you need to:\n",
     "* use the `nvshmem_ptr` to get pointers to your neighboring (top/bottom) `d_A` allocation\n",
@@ -1298,7 +1309,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.5"
+   "version": "3.7.0"
   },
   "toc": {
    "base_numbering": 1,
diff --git a/4-GPU/HandsOn/Solution/C/task6/poisson2d.solution.c b/4-GPU/HandsOn/Solution/C/task6/poisson2d.solution.c
index bf27e40780ccaa9824557e832b1209b3b6e9fbaa..c34eff96b7d833cf3d8e5cadc71ece659c18705e 100644
--- a/4-GPU/HandsOn/Solution/C/task6/poisson2d.solution.c
+++ b/4-GPU/HandsOn/Solution/C/task6/poisson2d.solution.c
@@ -210,9 +210,6 @@ int main(int argc, char **argv) {
             printf("%dx%d: 1 GPU: %8.4f s, %d GPUs: %8.4f s, speedup: %8.2f, efficiency: %8.2f%\n",
                    ny, nx, runtime_serial, size, runtime, runtime_serial / runtime,
                    runtime_serial / (size * runtime) * 100);
-            printf(
-                "MPI time: %8.4f s, inter GPU BW: %8.2f GiB/s\n", mpi_time,
-                (iter * 4 * (ix_end - ix_start) * sizeof(real)) / (1024 * 1024 * 1024 * mpi_time));
         }
     } else {
         errors = -1;
diff --git a/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.html b/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.html
index 151345c8a58f2f77b345ab5cf987bd7d4b75d779..565d6e790cda78ac147a5d86e993577fa61ec486 100644
--- a/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.html
+++ b/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.html
@@ -13087,7 +13087,7 @@ div#notebook {
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h2 id="Solutions">Solutions<a class="anchor-link" href="#Solutions">&#182;</a></h2><p><strong>This contains the output for the solutions.</strong></p>
-<p>The solutions are described in the solution section. The directory links to the solution source files should work though. For the <em>html</em> and <em>pdf</em> versions please navigate to the corresponding directory to find the solution profiles and sources.</p>
+<p>The solutions are described in the solution section. Please navigate to the corresponding directory to find the solution profiles and sources.</p>
 <h3 id="GPU-Programming">GPU Programming<a class="anchor-link" href="#GPU-Programming">&#182;</a></h3><ul>
 <li><a href="#solution0">Solution 0</a> Accelerate a CPU Jacobi solver with OpenACC relying on Unified Memory for data movement using <code>–ta=tesla:managed</code>  </li>
 </ul>
@@ -13108,7 +13108,7 @@ div#notebook {
 <li><a href="#solution5">Solution 5</a> Put NVSHMEM calls on stream to hide API calls and GPU/CPU synchronization  </li>
 </ul>
 <h3 id="Survey">Survey<a class="anchor-link" href="#Survey">&#182;</a></h3><ul>
-<li><a href="#survey">Suvery</a> Please remember to take the survey !</li>
+<li>Please remember to take the <a href="#survey">suvery</a> !</li>
 </ul>
 <hr>
 <hr>
@@ -13379,7 +13379,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d.solution
-Job &lt;25189&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25658&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 Jacobi relaxation Calculation: 2048 x 2048 mesh
@@ -13395,7 +13395,7 @@ GPU execution.
   200, 0.249522
   300, 0.249285
   400, 0.249048
-2048x2048: 1 CPU:   5.4684 s, 1 GPU:   0.1884 s, speedup:    29.02
+2048x2048: 1 CPU:   5.4111 s, 1 GPU:   0.1905 s, speedup:    28.40
 </pre>
 </div>
 </div>
@@ -13428,17 +13428,17 @@ GPU execution.
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof ./poisson2d.solution 10
-Job &lt;25190&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25659&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==91820== PGPROF is profiling process 91820, command: ./poisson2d.solution 10
-==91820== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof
+==77763== PGPROF is profiling process 77763, command: ./poisson2d.solution 10
+==77763== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof
 Jacobi relaxation Calculation: 2048 x 2048 mesh
 Calculate reference solution and time serial CPU execution.
     0, 0.249999
 GPU execution.
     0, 0.249999
-2048x2048: 1 CPU:   0.1230 s, 1 GPU:   0.0189 s, speedup:     6.51
+2048x2048: 1 CPU:   0.1194 s, 1 GPU:   0.0179 s, speedup:     6.67
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof .
 </pre>
 </div>
@@ -13603,7 +13603,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d.solution
-Job &lt;25191&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25660&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 Jacobi relaxation Calculation: 2048 x 2048 mesh
@@ -13619,7 +13619,7 @@ GPU execution.
   200, 0.249522
   300, 0.249285
   400, 0.249048
-2048x2048: 1 CPU:   5.4691 s, 1 GPU:   0.1866 s, speedup:    29.31
+2048x2048: 1 CPU:   5.3929 s, 1 GPU:   0.1903 s, speedup:    28.33
 </pre>
 </div>
 </div>
@@ -13652,43 +13652,43 @@ GPU execution.
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof ./poisson2d.solution 3
-Job &lt;25192&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25661&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==92054== PGPROF is profiling process 92054, command: ./poisson2d.solution 3
-==92054== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof
+==77997== PGPROF is profiling process 77997, command: ./poisson2d.solution 3
+==77997== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof
 Jacobi relaxation Calculation: 2048 x 2048 mesh
 Calculate reference solution and time serial CPU execution.
     0, 0.249999
 GPU execution.
     0, 0.249999
-2048x2048: 1 CPU:   0.0465 s, 1 GPU:   0.0154 s, speedup:     3.01
+2048x2048: 1 CPU:   0.0437 s, 1 GPU:   0.0164 s, speedup:     2.66
 bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  --analysis-metrics -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof ./poisson2d.solution 3
-Job &lt;25193&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25662&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==71647== PGPROF is profiling process 71647, command: ./poisson2d.solution 3
-==71647== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
-==71647== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof
+==79400== PGPROF is profiling process 79400, command: ./poisson2d.solution 3
+==79400== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
+==79400== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof
 Jacobi relaxation Calculation: 2048 x 2048 mesh
 Calculate reference solution and time serial CPU execution.
     0, 0.249999
 GPU execution.
     0, 0.249999
-2048x2048: 1 CPU:   0.0476 s, 1 GPU:  12.4561 s, speedup:     0.00
+2048x2048: 1 CPU:   0.0475 s, 1 GPU:  12.3314 s, speedup:     0.00
 bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  --metrics gld_efficiency,gst_efficiency -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof ./poisson2d.solution 3
-Job &lt;25194&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25663&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==92292== PGPROF is profiling process 92292, command: ./poisson2d.solution 3
-==92292== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
-==92292== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof
+==78235== PGPROF is profiling process 78235, command: ./poisson2d.solution 3
+==78235== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
+==78235== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof
 Jacobi relaxation Calculation: 2048 x 2048 mesh
 Calculate reference solution and time serial CPU execution.
     0, 0.249999
 GPU execution.
     0, 0.249999
-2048x2048: 1 CPU:   0.0487 s, 1 GPU:   0.6897 s, speedup:     0.07
+2048x2048: 1 CPU:   0.0483 s, 1 GPU:   0.6638 s, speedup:     0.07
 pgprof --csv -i /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof 2&gt;&amp;1 | grep -v &#34;======&#34; &gt; poisson2d.solution.efficiency.csv
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.*.pgprof .
 tar -cvzf pgprof.poisson2d.Task1.solution.tar.gz  poisson2d.solution.*.pgprof
@@ -13772,9 +13772,9 @@ If you purely work in a terminal you can view the same output by running <code>p
       <td>3</td>
       <td>gld_efficiency</td>
       <td>Global Memory Load Efficiency</td>
-      <td>90.868353%</td>
-      <td>90.896134%</td>
-      <td>90.881874%</td>
+      <td>90.866222%</td>
+      <td>91.051373%</td>
+      <td>90.962535%</td>
     </tr>
     <tr>
       <th>1</th>
@@ -13860,9 +13860,9 @@ If you purely work in a terminal you can view the same output by running <code>p
       <td>3</td>
       <td>gld_efficiency</td>
       <td>Global Memory Load Efficiency</td>
-      <td>91.834032%</td>
-      <td>91.855433%</td>
-      <td>91.843628%</td>
+      <td>91.850475%</td>
+      <td>91.857005%</td>
+      <td>91.854824%</td>
     </tr>
     <tr>
       <th>9</th>
@@ -14110,7 +14110,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; ./poisson2d.solution
-Job &lt;25195&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25664&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 Jacobi relaxation Calculation: 4096 x 4096 mesh
@@ -14137,8 +14137,8 @@ Parallel execution.
   800, 0.249524
   900, 0.249464
 Num GPUs: 2.
-4096x4096: 1 GPU:   1.3165 s, 2 GPUs:   0.7221 s, speedup:     1.82, efficiency:    91.17%
-MPI time:   0.0422 s, inter GPU BW:     2.89 GiB/s
+4096x4096: 1 GPU:   1.3190 s, 2 GPUs:   0.7096 s, speedup:     1.86, efficiency:    92.94%
+MPI time:   0.0424 s, inter GPU BW:     2.88 GiB/s
 </pre>
 </div>
 </div>
@@ -14171,21 +14171,21 @@ MPI time:   0.0422 s, inter GPU BW:     2.89 GiB/s
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10
-Job &lt;25196&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25665&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==92521== PGPROF is profiling process 92521, command: ./poisson2d.solution 10
-==92520== PGPROF is profiling process 92520, command: ./poisson2d.solution 10
-==92520== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.1.pgprof
+==78468== PGPROF is profiling process 78468, command: ./poisson2d.solution 10
+==78469== PGPROF is profiling process 78469, command: ./poisson2d.solution 10
+==78469== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.1.pgprof
 Jacobi relaxation Calculation: 4096 x 4096 mesh
 Calculate reference solution and time serial execution.
     0, 0.250000
 Parallel execution.
     0, 0.250000
 Num GPUs: 2.
-4096x4096: 1 GPU:   0.0224 s, 2 GPUs:   0.0130 s, speedup:     1.73, efficiency:    86.37%
-MPI time:   0.0007 s, inter GPU BW:     1.75 GiB/s
-==92521== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.0.pgprof
+4096x4096: 1 GPU:   0.0226 s, 2 GPUs:   0.0129 s, speedup:     1.75, efficiency:    87.45%
+MPI time:   0.0007 s, inter GPU BW:     1.70 GiB/s
+==78468== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.0.pgprof
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.?.pgprof  .
 tar -cvzf pgprof.poisson2d.Task2.solution.tar.gz poisson2d.solution.Task2.NP2.?.pgprof
 poisson2d.solution.Task2.NP2.0.pgprof
@@ -14285,30 +14285,30 @@ poisson2d.solution.Task2.NP2.1.pgprof
     <tr>
       <th>0</th>
       <td>1</td>
-      <td>1.4201</td>
+      <td>1.4053</td>
       <td>0.93,</td>
-      <td>92.67%</td>
+      <td>93.06%</td>
     </tr>
     <tr>
       <th>1</th>
       <td>2</td>
-      <td>0.7157</td>
+      <td>0.7154</td>
       <td>1.83,</td>
-      <td>91.44%</td>
+      <td>91.56%</td>
     </tr>
     <tr>
       <th>2</th>
       <td>4</td>
-      <td>0.4301</td>
-      <td>3.08,</td>
-      <td>76.91%</td>
+      <td>0.4211</td>
+      <td>3.13,</td>
+      <td>78.21%</td>
     </tr>
     <tr>
       <th>3</th>
       <td>6</td>
-      <td>0.3037</td>
-      <td>4.32,</td>
-      <td>71.94%</td>
+      <td>0.3121</td>
+      <td>4.20,</td>
+      <td>70.05%</td>
     </tr>
   </tbody>
 </table>
@@ -14520,7 +14520,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; ./poisson2d.solution
-Job &lt;25201&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25670&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 Jacobi relaxation Calculation: 4096 x 4096 mesh
@@ -14547,8 +14547,8 @@ Parallel execution.
   800, 0.249524
   900, 0.249464
 Num GPUs: 2.
-4096x4096: 1 GPU:   1.3175 s, 2 GPUs:   0.6962 s, speedup:     1.89, efficiency:    94.62%
-MPI time:   0.0583 s, inter GPU BW:     2.09 GiB/s
+4096x4096: 1 GPU:   1.3172 s, 2 GPUs:   0.6964 s, speedup:     1.89, efficiency:    94.57%
+MPI time:   0.0561 s, inter GPU BW:     2.17 GiB/s
 </pre>
 </div>
 </div>
@@ -14581,21 +14581,21 @@ MPI time:   0.0583 s, inter GPU BW:     2.09 GiB/s
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10
-Job &lt;25202&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25671&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==93249== PGPROF is profiling process 93249, command: ./poisson2d.solution 10
-==93248== PGPROF is profiling process 93248, command: ./poisson2d.solution 10
-==93249== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.1.pgprof
+==79190== PGPROF is profiling process 79190, command: ./poisson2d.solution 10
+==79192== PGPROF is profiling process 79192, command: ./poisson2d.solution 10
+==79192== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.1.pgprof
 Jacobi relaxation Calculation: 4096 x 4096 mesh
 Calculate reference solution and time serial execution.
     0, 0.250000
 Parallel execution.
     0, 0.250000
 Num GPUs: 2.
-4096x4096: 1 GPU:   0.0262 s, 2 GPUs:   0.0127 s, speedup:     2.06, efficiency:   103.02%
-MPI time:   0.0009 s, inter GPU BW:     1.39 GiB/s
-==93248== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.0.pgprof
+4096x4096: 1 GPU:   0.0301 s, 2 GPUs:   0.0126 s, speedup:     2.39, efficiency:   119.53%
+MPI time:   0.0009 s, inter GPU BW:     1.34 GiB/s
+==79190== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.0.pgprof
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.?.pgprof  .
 tar -cvzf pgprof.poisson2d.Task3.solution.tar.gz poisson2d.solution.Task3.NP2.?.pgprof
 poisson2d.solution.Task3.NP2.0.pgprof
@@ -14695,30 +14695,30 @@ poisson2d.solution.Task3.NP2.1.pgprof
     <tr>
       <th>0</th>
       <td>1</td>
-      <td>1.3935</td>
-      <td>0.94,</td>
-      <td>93.86%</td>
+      <td>1.3815</td>
+      <td>0.95,</td>
+      <td>94.79%</td>
     </tr>
     <tr>
       <th>1</th>
       <td>2</td>
-      <td>0.6910</td>
-      <td>1.89,</td>
-      <td>94.52%</td>
+      <td>0.6968</td>
+      <td>1.90,</td>
+      <td>94.91%</td>
     </tr>
     <tr>
       <th>2</th>
       <td>4</td>
-      <td>0.3920</td>
-      <td>3.37,</td>
-      <td>84.13%</td>
+      <td>0.3990</td>
+      <td>3.30,</td>
+      <td>82.56%</td>
     </tr>
     <tr>
       <th>3</th>
       <td>6</td>
-      <td>0.2841</td>
-      <td>4.58,</td>
-      <td>76.29%</td>
+      <td>0.2720</td>
+      <td>4.81,</td>
+      <td>80.18%</td>
     </tr>
   </tbody>
 </table>
@@ -14745,11 +14745,11 @@ poisson2d.solution.Task3.NP2.1.pgprof
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Solution-4:">Solution 4:<a name="solution4" /><a class="anchor-link" href="#Solution-4:">&#182;</a></h2><p>Include NVSHMEM headers</p>
+<h2 id="Solution-4:">Solution 4:<a name="solution4" /><a class="anchor-link" href="#Solution-4:">&#182;</a></h2><p>First, include NVSHMEM headers</p>
 <div class="highlight"><pre><span></span><span class="cp">#include</span> <span class="cpf">&lt;nvshmem.h&gt;</span><span class="cp"></span>
 <span class="cp">#include</span> <span class="cpf">&lt;nvshmemx.h&gt;</span><span class="cp"></span>
 </pre></div>
-<p>and initalize NVSHMEM with MPI</p>
+<p>and initialize NVSHMEM with MPI</p>
 <div class="highlight"><pre><span></span><span class="n">MPI_Comm</span> <span class="n">mpi_comm</span> <span class="o">=</span> <span class="n">MPI_COMM_WORLD</span><span class="p">;</span>
 <span class="n">nvshmemx_init_attr_t</span> <span class="n">attr</span><span class="p">;</span>
 <span class="n">attr</span><span class="p">.</span><span class="n">mpi_comm</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">mpi_comm</span><span class="p">;</span>
@@ -14881,7 +14881,7 @@ poisson2d_serial(int, int, double, double *, double *, int, int, const double *)
      65, Generating Tesla code
          67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
      77, Generating update self(Aref[:ny*nx])
-mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/gpfs/wolf/trn003/world-shared/software/nvshmem//include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/gpfs/wolf/trn003/world-shared/software/nvshmem//lib -lnvshmem -Mcuda -lcuda -lrt 
+mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/gpfs/wolf/trn003/world-shared/software/nvshmem/include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/gpfs/wolf/trn003/world-shared/software/nvshmem/lib -lnvshmem -Mcuda -lcuda -lrt 
 poisson2d.solution.c:
 main:
      90, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])
@@ -14939,7 +14939,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; ./poisson2d.solution
-Job &lt;25207&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25676&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation 
@@ -14967,8 +14967,8 @@ Parallel execution.
   800, 0.249524
   900, 0.249464
 Num GPUs: 2.
-4096x4096: 1 GPU:   1.3171 s, 2 GPUs:   0.7377 s, speedup:     1.79, efficiency:    89.27%
-MPI time:   0.0686 s, inter GPU BW:     1.78 GiB/s
+4096x4096: 1 GPU:   1.3188 s, 2 GPUs:   0.7398 s, speedup:     1.78, efficiency:    89.13%
+MPI time:   0.0644 s, inter GPU BW:     1.90 GiB/s
 </pre>
 </div>
 </div>
@@ -15001,12 +15001,11 @@ MPI time:   0.0686 s, inter GPU BW:     1.78 GiB/s
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10
-Job &lt;25208&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25677&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==93971== PGPROF is profiling process 93971, command: ./poisson2d.solution 10
-==93970== PGPROF is profiling process 93970, command: ./poisson2d.solution 10
-==93971== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.0.pgprof
+==79915== PGPROF is profiling process 79915, command: ./poisson2d.solution 10
+==79914== PGPROF is profiling process 79914, command: ./poisson2d.solution 10
 WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation 
 Jacobi relaxation Calculation: 4096 x 4096 mesh
 Calculate reference solution and time serial execution.
@@ -15014,9 +15013,10 @@ Calculate reference solution and time serial execution.
 Parallel execution.
     0, 0.250000
 Num GPUs: 2.
-4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0132 s, speedup:     1.71, efficiency:    85.34%
-MPI time:   0.0010 s, inter GPU BW:     1.24 GiB/s
-==93970== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.1.pgprof
+4096x4096: 1 GPU:   0.0226 s, 2 GPUs:   0.0131 s, speedup:     1.72, efficiency:    86.13%
+MPI time:   0.0010 s, inter GPU BW:     1.27 GiB/s
+==79915== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.0.pgprof
+==79914== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.1.pgprof
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.?.pgprof  .
 tar -cvzf pgprof.poisson2d.Task4.solution.tar.gz poisson2d.solution.Task4.NP2.?.pgprof
 poisson2d.solution.Task4.NP2.0.pgprof
@@ -15116,30 +15116,30 @@ poisson2d.solution.Task4.NP2.1.pgprof
     <tr>
       <th>0</th>
       <td>1</td>
-      <td>1.3685</td>
+      <td>1.3714</td>
       <td>0.96,</td>
-      <td>96.08%</td>
+      <td>95.91%</td>
     </tr>
     <tr>
       <th>1</th>
       <td>2</td>
-      <td>0.7472</td>
-      <td>1.78,</td>
-      <td>88.90%</td>
+      <td>0.7460</td>
+      <td>1.76,</td>
+      <td>88.19%</td>
     </tr>
     <tr>
       <th>2</th>
       <td>4</td>
-      <td>0.4605</td>
-      <td>2.85,</td>
-      <td>71.27%</td>
+      <td>0.4706</td>
+      <td>2.80,</td>
+      <td>70.05%</td>
     </tr>
     <tr>
       <th>3</th>
       <td>6</td>
-      <td>0.3612</td>
-      <td>3.60,</td>
-      <td>60.05%</td>
+      <td>0.3308</td>
+      <td>3.91,</td>
+      <td>65.18%</td>
     </tr>
   </tbody>
 </table>
@@ -15288,16 +15288,16 @@ main:
         142, #pragma acc loop vector(128) /* threadIdx.x */
         146, Generating implicit reduction(max:error)
     142, Loop is parallelizable
-    152, Generating present(Anew[:],A[:])
+    154, Generating present(Anew[:],A[:])
          Generating Tesla code
-        155, #pragma acc loop gang /* blockIdx.x */
-        156, #pragma acc loop vector(128) /* threadIdx.x */
-    156, Loop is parallelizable
-    190, Generating present(A[:])
+        157, #pragma acc loop gang /* blockIdx.x */
+        158, #pragma acc loop vector(128) /* threadIdx.x */
+    158, Loop is parallelizable
+    192, Generating present(A[:])
          Generating Tesla code
-        193, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
-    203, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
-    221, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])
+        195, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
+    205, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
+    223, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])
 </pre>
 </div>
 </div>
@@ -15330,7 +15330,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; ./poisson2d.solution
-Job &lt;25213&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25682&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation 
@@ -15358,7 +15358,7 @@ Parallel execution.
   800, 0.249524
   900, 0.249464
 Num GPUs: 2.
-4096x4096: 1 GPU:   1.3176 s, 2 GPUs:   0.6777 s, speedup:     1.94, efficiency:    97.22%
+4096x4096: 1 GPU:   1.3210 s, 2 GPUs:   0.6750 s, speedup:     1.96, efficiency:    97.86%
 </pre>
 </div>
 </div>
@@ -15391,12 +15391,12 @@ Num GPUs: 2.
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10
-Job &lt;25214&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25683&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==94705== PGPROF is profiling process 94705, command: ./poisson2d.solution 10
-==94707== PGPROF is profiling process 94707, command: ./poisson2d.solution 10
-==94707== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.1.pgprof
+==80646== PGPROF is profiling process 80646, command: ./poisson2d.solution 10
+==80644== PGPROF is profiling process 80644, command: ./poisson2d.solution 10
+==80646== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.0.pgprof
 WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation 
 Jacobi relaxation Calculation: 4096 x 4096 mesh
 Calculate reference solution and time serial execution.
@@ -15404,8 +15404,8 @@ Calculate reference solution and time serial execution.
 Parallel execution.
     0, 0.250000
 Num GPUs: 2.
-4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0117 s, speedup:     1.92, efficiency:    96.05%
-==94705== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.0.pgprof
+4096x4096: 1 GPU:   0.0227 s, 2 GPUs:   0.0120 s, speedup:     1.89, efficiency:    94.65%
+==80644== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.1.pgprof
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.?.pgprof  .
 tar -cvzf pgprof.poisson2d.Task5.solution.tar.gz poisson2d.solution.Task5.NP2.?.pgprof
 poisson2d.solution.Task5.NP2.0.pgprof
@@ -15505,30 +15505,30 @@ poisson2d.solution.Task5.NP2.1.pgprof
     <tr>
       <th>0</th>
       <td>1</td>
-      <td>1.2915</td>
-      <td>1.02,</td>
-      <td>101.63%</td>
+      <td>1.3004</td>
+      <td>1.01,</td>
+      <td>101.04%</td>
     </tr>
     <tr>
       <th>1</th>
       <td>2</td>
-      <td>0.6742</td>
-      <td>1.96,</td>
-      <td>98.08%</td>
+      <td>0.6705</td>
+      <td>1.95,</td>
+      <td>97.67%</td>
     </tr>
     <tr>
       <th>2</th>
       <td>4</td>
-      <td>0.3801</td>
-      <td>3.47,</td>
-      <td>86.66%</td>
+      <td>0.3879</td>
+      <td>3.41,</td>
+      <td>85.14%</td>
     </tr>
     <tr>
       <th>3</th>
       <td>6</td>
-      <td>0.2733</td>
-      <td>4.80,</td>
-      <td>80.04%</td>
+      <td>0.2745</td>
+      <td>4.81,</td>
+      <td>80.25%</td>
     </tr>
   </tbody>
 </table>
@@ -15555,7 +15555,29 @@ poisson2d.solution.Task5.NP2.1.pgprof
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<h2 id="Solution-6:-TODO">Solution 6:<a name="solution6" /> TODO<a class="anchor-link" href="#Solution-6:-TODO">&#182;</a></h2><h4 id="Code">Code<a class="anchor-link" href="#Code">&#182;</a></h4><ul>
+<h2 id="Solution-6:">Solution 6:<a name="solution6" /><a class="anchor-link" href="#Solution-6:">&#182;</a></h2><p>The most important part here is to get an <code>nvshmem_ptr</code> pointing to the symmetric <code>d_A</code> allocation of your top and bottom neighbor.</p>
+<div class="highlight"><pre><span></span><span class="n">real</span> <span class="o">*</span> <span class="kr">restrict</span> <span class="n">d_Atop</span> <span class="o">=</span> <span class="p">(</span><span class="n">real</span> <span class="o">*</span><span class="p">)</span><span class="n">nvshmem_ptr</span><span class="p">(</span><span class="n">d_A</span><span class="p">,</span> <span class="n">top</span><span class="p">);</span>
+<span class="n">real</span> <span class="o">*</span> <span class="kr">restrict</span> <span class="n">d_Abottom</span> <span class="o">=</span> <span class="p">(</span><span class="n">real</span> <span class="o">*</span><span class="p">)</span><span class="n">nvshmem_ptr</span><span class="p">(</span><span class="n">d_A</span><span class="p">,</span> <span class="n">bottom</span><span class="p">);</span>
+</pre></div>
+<p>When updating <code>A</code> from Anew make sure to also update <code>A</code> on your top and bottom neighbor if you are at the boundary:</p>
+<div class="highlight"><pre><span></span><span class="cp">#pragma acc parallel loop present(A, Anew) deviceptr(d_Atop, d_Abottom) async</span>
+<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">iy</span> <span class="o">=</span> <span class="n">iy_start</span><span class="p">;</span> <span class="n">iy</span> <span class="o">&lt;</span> <span class="n">iy_end</span><span class="p">;</span> <span class="n">iy</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
+    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">ix</span> <span class="o">=</span> <span class="n">ix_start</span><span class="p">;</span> <span class="n">ix</span> <span class="o">&lt;</span> <span class="n">ix_end</span><span class="p">;</span> <span class="n">ix</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
+        <span class="n">A</span><span class="p">[</span><span class="n">iy</span> <span class="o">*</span> <span class="n">nx</span> <span class="o">+</span> <span class="n">ix</span><span class="p">]</span> <span class="o">=</span> <span class="n">Anew</span><span class="p">[</span><span class="n">iy</span> <span class="o">*</span> <span class="n">nx</span> <span class="o">+</span> <span class="n">ix</span><span class="p">];</span>
+        <span class="k">if</span><span class="p">(</span><span class="n">iy</span> <span class="o">==</span> <span class="n">iy_start</span><span class="p">){</span><span class="c1">// this also needs to go to the lower halo region of my upper neighbor</span>
+            <span class="n">d_Atop</span><span class="p">[</span><span class="n">iy_end_top</span> <span class="o">*</span> <span class="n">nx</span> <span class="o">+</span> <span class="n">ix</span><span class="p">]</span> <span class="o">=</span> <span class="n">Anew</span><span class="p">[</span><span class="n">iy</span> <span class="o">*</span> <span class="n">nx</span> <span class="o">+</span> <span class="n">ix</span><span class="p">];</span>
+        <span class="p">}</span>
+        <span class="k">if</span><span class="p">(</span><span class="n">iy</span> <span class="o">==</span> <span class="n">iy_end</span> <span class="o">-</span><span class="mi">1</span><span class="p">){</span><span class="c1">// this also needs to go to the upper halo region of my bottom neighbor</span>
+            <span class="n">d_Abottom</span><span class="p">[(</span><span class="n">iy_start_bottom</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">nx</span> <span class="o">+</span> <span class="n">ix</span><span class="p">]</span>  <span class="o">=</span>  <span class="n">Anew</span><span class="p">[</span><span class="n">iy</span> <span class="o">*</span> <span class="n">nx</span> <span class="o">+</span> <span class="n">ix</span><span class="p">];</span>
+        <span class="p">}</span>
+    <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+<p>We can then remove the explicit <code>nvhsmem_put</code> calls on completely. But remember to still keep the barrier.</p>
+<div class="highlight"><pre><span></span><span class="n">nvshmemx_barrier_all_on_stream</span><span class="p">((</span><span class="n">cudaStream_t</span><span class="p">)</span><span class="n">acc_get_cuda_stream</span><span class="p">(</span><span class="n">acc_get_default_async</span><span class="p">()));</span>
+<span class="err">`</span>
+</pre></div>
+<h4 id="Code">Code<a class="anchor-link" href="#Code">&#182;</a></h4><ul>
 <li><a href="./C/task6/poisson2d.solution.c">C Version</a></li>
 </ul>
 
@@ -15652,23 +15674,23 @@ main:
         110, #pragma acc loop gang /* blockIdx.x */
         111, #pragma acc loop vector(128) /* threadIdx.x */
     111, Loop is parallelizable
-    159, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
-    160, Generating present(A[:],rhs[:],Anew[:])
+    158, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
+    159, Generating present(A[:],rhs[:],Anew[:])
          Generating Tesla code
-        165, #pragma acc loop gang /* blockIdx.x */
-        166, #pragma acc loop vector(128) /* threadIdx.x */
-        170, Generating implicit reduction(max:error)
-    166, Loop is parallelizable
-    176, Generating present(Anew[:],A[:])
+        162, #pragma acc loop gang /* blockIdx.x */
+        163, #pragma acc loop vector(128) /* threadIdx.x */
+        167, Generating implicit reduction(max:error)
+    163, Loop is parallelizable
+    174, Generating present(Anew[:],A[:])
          Generating Tesla code
-        179, #pragma acc loop gang /* blockIdx.x */
-        181, #pragma acc loop vector(128) /* threadIdx.x */
-    181, Loop is parallelizable
-    192, Generating present(A[:])
+        177, #pragma acc loop gang /* blockIdx.x */
+        179, #pragma acc loop vector(128) /* threadIdx.x */
+    179, Loop is parallelizable
+    190, Generating present(A[:])
          Generating Tesla code
-        195, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
-    205, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
-    224, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])
+        193, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
+    203, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])
+    219, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])
 </pre>
 </div>
 </div>
@@ -15701,7 +15723,7 @@ main:
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; ./poisson2d.solution
-Job &lt;25219&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25688&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
 WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation 
@@ -15729,8 +15751,7 @@ Parallel execution.
   800, 0.249524
   900, 0.249464
 Num GPUs: 2.
-4096x4096: 1 GPU:   1.3157 s, 2 GPUs:   0.6533 s, speedup:     2.01, efficiency:   100.70%
-MPI time:   0.0000 s, inter GPU BW:      inf GiB/s
+4096x4096: 1 GPU:   1.3196 s, 2 GPUs:   0.6641 s, speedup:     1.99, efficiency:    99.34%
 </pre>
 </div>
 </div>
@@ -15763,12 +15784,12 @@ MPI time:   0.0000 s, inter GPU BW:      inf GiB/s
 
 <div class="output_subarea output_stream output_stdout output_text">
 <pre>bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs &#34;-gpu&#34; pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10
-Job &lt;25220&gt; is submitted to default queue &lt;batch&gt;.
+Job &lt;25689&gt; is submitted to default queue &lt;batch&gt;.
 &lt;&lt;Waiting for dispatch ...&gt;&gt;
 &lt;&lt;Starting on login1&gt;&gt;
-==95445== PGPROF is profiling process 95445, command: ./poisson2d.solution 10
-==95446== PGPROF is profiling process 95446, command: ./poisson2d.solution 10
-==95445== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.1.pgprof
+==81382== PGPROF is profiling process 81382, command: ./poisson2d.solution 10
+==81383== PGPROF is profiling process 81383, command: ./poisson2d.solution 10
+==81382== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.1.pgprof
 WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation 
 Jacobi relaxation Calculation: 4096 x 4096 mesh
 Calculate reference solution and time serial execution.
@@ -15776,9 +15797,8 @@ Calculate reference solution and time serial execution.
 Parallel execution.
     0, 0.250000
 Num GPUs: 2.
-4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0116 s, speedup:     1.94, efficiency:    96.85%
-MPI time:   0.0000 s, inter GPU BW:      inf GiB/s
-==95446== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.0.pgprof
+4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0118 s, speedup:     1.91, efficiency:    95.50%
+==81383== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.0.pgprof
 mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.?.pgprof  .
 tar -cvzf pgprof.poisson2d.Task6.solution.tar.gz poisson2d.solution.Task6.NP2.?.pgprof
 poisson2d.solution.Task6.NP2.0.pgprof
@@ -15878,30 +15898,30 @@ poisson2d.solution.Task6.NP2.1.pgprof
     <tr>
       <th>0</th>
       <td>1</td>
-      <td>1.2869</td>
-      <td>1.02,</td>
-      <td>102.05%</td>
+      <td>1.2964</td>
+      <td>1.01,</td>
+      <td>101.26%</td>
     </tr>
     <tr>
       <th>1</th>
       <td>2</td>
-      <td>0.6574</td>
-      <td>1.99,</td>
-      <td>99.26%</td>
+      <td>0.6714</td>
+      <td>1.94,</td>
+      <td>96.87%</td>
     </tr>
     <tr>
       <th>2</th>
       <td>4</td>
-      <td>0.3670</td>
-      <td>3.59,</td>
-      <td>89.71%</td>
+      <td>0.3810</td>
+      <td>3.46,</td>
+      <td>86.47%</td>
     </tr>
     <tr>
       <th>3</th>
       <td>6</td>
-      <td>0.2450</td>
-      <td>5.37,</td>
-      <td>89.42%</td>
+      <td>0.2641</td>
+      <td>4.87,</td>
+      <td>81.16%</td>
     </tr>
   </tbody>
 </table>
@@ -15917,7 +15937,7 @@ poisson2d.solution.Task6.NP2.1.pgprof
 <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
-<p>The missing of device copies can be seen in the profiler, e.g. as shown below.</p>
+<p>The missing of device copies can be seen in the profiler, e.g. as shown below. There are only kernels running mostly back-to-back, only interrupted by the global reduction.</p>
 <p><img src="./resources/Solution6.png" alt="Solution6.png"></p>
 <p><a href="#top">Back to Top</a></p>
 <hr>
diff --git a/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.ipynb b/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.ipynb
index fccd24cd271883801910fafe3c03c6b0c95732fb..9ea3fd00f467e8ba22083aa3ec6d3e86d5c599fd 100644
--- a/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.ipynb
+++ b/4-GPU/HandsOn/Solution/HandsOnGPUProgramming_Solution.ipynb
@@ -20,7 +20,7 @@
     "\n",
     "**This contains the output for the solutions.**\n",
     "\n",
-    "The solutions are described in the solution section. The directory links to the solution source files should work though. For the _html_ and _pdf_ versions please navigate to the corresponding directory to find the solution profiles and sources.\n",
+    "The solutions are described in the solution section. Please navigate to the corresponding directory to find the solution profiles and sources.\n",
     "\n",
     "\n",
     "### GPU Programming\n",
@@ -51,7 +51,7 @@
     "\n",
     "### Survey\n",
     " \n",
-    " * [Suvery](#survey) Please remember to take the survey !\n",
+    " * Please remember to take the [suvery](#survey) !\n",
     "\n",
     "---\n",
     "---"
@@ -269,7 +269,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d.solution\n",
-      "Job <25189> is submitted to default queue <batch>.\n",
+      "Job <25658> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
@@ -285,7 +285,7 @@
       "  200, 0.249522\n",
       "  300, 0.249285\n",
       "  400, 0.249048\n",
-      "2048x2048: 1 CPU:   5.4684 s, 1 GPU:   0.1884 s, speedup:    29.02\n"
+      "2048x2048: 1 CPU:   5.4111 s, 1 GPU:   0.1905 s, speedup:    28.40\n"
      ]
     }
    ],
@@ -306,17 +306,17 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof ./poisson2d.solution 10\n",
-      "Job <25190> is submitted to default queue <batch>.\n",
+      "Job <25659> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==91820== PGPROF is profiling process 91820, command: ./poisson2d.solution 10\n",
-      "==91820== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof\n",
+      "==77763== PGPROF is profiling process 77763, command: ./poisson2d.solution 10\n",
+      "==77763== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof\n",
       "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
       "Calculate reference solution and time serial CPU execution.\n",
       "    0, 0.249999\n",
       "GPU execution.\n",
       "    0, 0.249999\n",
-      "2048x2048: 1 CPU:   0.1230 s, 1 GPU:   0.0189 s, speedup:     6.51\n",
+      "2048x2048: 1 CPU:   0.1194 s, 1 GPU:   0.0179 s, speedup:     6.67\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.pgprof .\n"
      ]
     }
@@ -456,7 +456,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS ./poisson2d.solution\n",
-      "Job <25191> is submitted to default queue <batch>.\n",
+      "Job <25660> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
@@ -472,7 +472,7 @@
       "  200, 0.249522\n",
       "  300, 0.249285\n",
       "  400, 0.249048\n",
-      "2048x2048: 1 CPU:   5.4691 s, 1 GPU:   0.1866 s, speedup:    29.31\n"
+      "2048x2048: 1 CPU:   5.3929 s, 1 GPU:   0.1903 s, speedup:    28.33\n"
      ]
     }
    ],
@@ -493,43 +493,43 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof ./poisson2d.solution 3\n",
-      "Job <25192> is submitted to default queue <batch>.\n",
+      "Job <25661> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==92054== PGPROF is profiling process 92054, command: ./poisson2d.solution 3\n",
-      "==92054== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof\n",
+      "==77997== PGPROF is profiling process 77997, command: ./poisson2d.solution 3\n",
+      "==77997== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.timeline.pgprof\n",
       "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
       "Calculate reference solution and time serial CPU execution.\n",
       "    0, 0.249999\n",
       "GPU execution.\n",
       "    0, 0.249999\n",
-      "2048x2048: 1 CPU:   0.0465 s, 1 GPU:   0.0154 s, speedup:     3.01\n",
+      "2048x2048: 1 CPU:   0.0437 s, 1 GPU:   0.0164 s, speedup:     2.66\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  --analysis-metrics -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof ./poisson2d.solution 3\n",
-      "Job <25193> is submitted to default queue <batch>.\n",
+      "Job <25662> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==71647== PGPROF is profiling process 71647, command: ./poisson2d.solution 3\n",
-      "==71647== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.\n",
-      "==71647== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof\n",
+      "==79400== PGPROF is profiling process 79400, command: ./poisson2d.solution 3\n",
+      "==79400== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.\n",
+      "==79400== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.metrics.pgprof\n",
       "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
       "Calculate reference solution and time serial CPU execution.\n",
       "    0, 0.249999\n",
       "GPU execution.\n",
       "    0, 0.249999\n",
-      "2048x2048: 1 CPU:   0.0476 s, 1 GPU:  12.4561 s, speedup:     0.00\n",
+      "2048x2048: 1 CPU:   0.0475 s, 1 GPU:  12.3314 s, speedup:     0.00\n",
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS pgprof -f --cpu-profiling off --openmp-profiling off  --metrics gld_efficiency,gst_efficiency -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof ./poisson2d.solution 3\n",
-      "Job <25194> is submitted to default queue <batch>.\n",
+      "Job <25663> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==92292== PGPROF is profiling process 92292, command: ./poisson2d.solution 3\n",
-      "==92292== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.\n",
-      "==92292== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof\n",
+      "==78235== PGPROF is profiling process 78235, command: ./poisson2d.solution 3\n",
+      "==78235== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.\n",
+      "==78235== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof\n",
       "Jacobi relaxation Calculation: 2048 x 2048 mesh\n",
       "Calculate reference solution and time serial CPU execution.\n",
       "    0, 0.249999\n",
       "GPU execution.\n",
       "    0, 0.249999\n",
-      "2048x2048: 1 CPU:   0.0487 s, 1 GPU:   0.6897 s, speedup:     0.07\n",
+      "2048x2048: 1 CPU:   0.0483 s, 1 GPU:   0.6638 s, speedup:     0.07\n",
       "pgprof --csv -i /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.efficiency.pgprof 2>&1 | grep -v \"======\" > poisson2d.solution.efficiency.csv\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.*.pgprof .\n",
       "tar -cvzf pgprof.poisson2d.Task1.solution.tar.gz  poisson2d.solution.*.pgprof\n",
@@ -600,9 +600,9 @@
        "      <td>3</td>\n",
        "      <td>gld_efficiency</td>\n",
        "      <td>Global Memory Load Efficiency</td>\n",
-       "      <td>90.868353%</td>\n",
-       "      <td>90.896134%</td>\n",
-       "      <td>90.881874%</td>\n",
+       "      <td>90.866222%</td>\n",
+       "      <td>91.051373%</td>\n",
+       "      <td>90.962535%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
@@ -688,9 +688,9 @@
        "      <td>3</td>\n",
        "      <td>gld_efficiency</td>\n",
        "      <td>Global Memory Load Efficiency</td>\n",
-       "      <td>91.834032%</td>\n",
-       "      <td>91.855433%</td>\n",
-       "      <td>91.843628%</td>\n",
+       "      <td>91.850475%</td>\n",
+       "      <td>91.857005%</td>\n",
+       "      <td>91.854824%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9</th>\n",
@@ -745,7 +745,7 @@
        "11  Tesla V100-SXM2-16GB (0)      main_111_gpu            3  gst_efficiency   \n",
        "\n",
        "                Metric Description          Min          Max          Avg  \n",
-       "0    Global Memory Load Efficiency   90.868353%   90.896134%   90.881874%  \n",
+       "0    Global Memory Load Efficiency   90.866222%   91.051373%   90.962535%  \n",
        "1   Global Memory Store Efficiency   88.956522%   88.956522%   88.956522%  \n",
        "2    Global Memory Load Efficiency   94.722222%   94.722222%   94.722222%  \n",
        "3   Global Memory Store Efficiency   88.956522%   88.956522%   88.956522%  \n",
@@ -753,7 +753,7 @@
        "5   Global Memory Store Efficiency   25.000000%   25.000000%   25.000000%  \n",
        "6    Global Memory Load Efficiency    0.000000%    0.000000%    0.000000%  \n",
        "7   Global Memory Store Efficiency  100.000000%  100.000000%  100.000000%  \n",
-       "8    Global Memory Load Efficiency   91.834032%   91.855433%   91.843628%  \n",
+       "8    Global Memory Load Efficiency   91.850475%   91.857005%   91.854824%  \n",
        "9   Global Memory Store Efficiency   88.845486%   88.845486%   88.845486%  \n",
        "10   Global Memory Load Efficiency   25.000000%   25.000000%   25.000000%  \n",
        "11  Global Memory Store Efficiency   25.000000%   25.000000%   25.000000%  "
@@ -950,7 +950,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
-      "Job <25195> is submitted to default queue <batch>.\n",
+      "Job <25664> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
@@ -977,8 +977,8 @@
       "  800, 0.249524\n",
       "  900, 0.249464\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   1.3165 s, 2 GPUs:   0.7221 s, speedup:     1.82, efficiency:    91.17%\n",
-      "MPI time:   0.0422 s, inter GPU BW:     2.89 GiB/s\n"
+      "4096x4096: 1 GPU:   1.3190 s, 2 GPUs:   0.7096 s, speedup:     1.86, efficiency:    92.94%\n",
+      "MPI time:   0.0424 s, inter GPU BW:     2.88 GiB/s\n"
      ]
     }
    ],
@@ -999,21 +999,21 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
-      "Job <25196> is submitted to default queue <batch>.\n",
+      "Job <25665> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==92521== PGPROF is profiling process 92521, command: ./poisson2d.solution 10\n",
-      "==92520== PGPROF is profiling process 92520, command: ./poisson2d.solution 10\n",
-      "==92520== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.1.pgprof\n",
+      "==78468== PGPROF is profiling process 78468, command: ./poisson2d.solution 10\n",
+      "==78469== PGPROF is profiling process 78469, command: ./poisson2d.solution 10\n",
+      "==78469== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.1.pgprof\n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
       "Calculate reference solution and time serial execution.\n",
       "    0, 0.250000\n",
       "Parallel execution.\n",
       "    0, 0.250000\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   0.0224 s, 2 GPUs:   0.0130 s, speedup:     1.73, efficiency:    86.37%\n",
-      "MPI time:   0.0007 s, inter GPU BW:     1.75 GiB/s\n",
-      "==92521== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.0.pgprof\n",
+      "4096x4096: 1 GPU:   0.0226 s, 2 GPUs:   0.0129 s, speedup:     1.75, efficiency:    87.45%\n",
+      "MPI time:   0.0007 s, inter GPU BW:     1.70 GiB/s\n",
+      "==78468== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.0.pgprof\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task2.NP2.?.pgprof  .\n",
       "tar -cvzf pgprof.poisson2d.Task2.solution.tar.gz poisson2d.solution.Task2.NP2.?.pgprof\n",
       "poisson2d.solution.Task2.NP2.0.pgprof\n",
@@ -1089,30 +1089,30 @@
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
-       "      <td>1.4201</td>\n",
+       "      <td>1.4053</td>\n",
        "      <td>0.93,</td>\n",
-       "      <td>92.67%</td>\n",
+       "      <td>93.06%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2</td>\n",
-       "      <td>0.7157</td>\n",
+       "      <td>0.7154</td>\n",
        "      <td>1.83,</td>\n",
-       "      <td>91.44%</td>\n",
+       "      <td>91.56%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>4</td>\n",
-       "      <td>0.4301</td>\n",
-       "      <td>3.08,</td>\n",
-       "      <td>76.91%</td>\n",
+       "      <td>0.4211</td>\n",
+       "      <td>3.13,</td>\n",
+       "      <td>78.21%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>6</td>\n",
-       "      <td>0.3037</td>\n",
-       "      <td>4.32,</td>\n",
-       "      <td>71.94%</td>\n",
+       "      <td>0.3121</td>\n",
+       "      <td>4.20,</td>\n",
+       "      <td>70.05%</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
@@ -1120,10 +1120,10 @@
       ],
       "text/plain": [
        "   GPUs  time [s] speedup efficiency\n",
-       "0     1    1.4201   0.93,     92.67%\n",
-       "1     2    0.7157   1.83,     91.44%\n",
-       "2     4    0.4301   3.08,     76.91%\n",
-       "3     6    0.3037   4.32,     71.94%"
+       "0     1    1.4053   0.93,     93.06%\n",
+       "1     2    0.7154   1.83,     91.56%\n",
+       "2     4    0.4211   3.13,     78.21%\n",
+       "3     6    0.3121   4.20,     70.05%"
       ]
      },
      "execution_count": 16,
@@ -1323,7 +1323,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
-      "Job <25201> is submitted to default queue <batch>.\n",
+      "Job <25670> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
@@ -1350,8 +1350,8 @@
       "  800, 0.249524\n",
       "  900, 0.249464\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   1.3175 s, 2 GPUs:   0.6962 s, speedup:     1.89, efficiency:    94.62%\n",
-      "MPI time:   0.0583 s, inter GPU BW:     2.09 GiB/s\n"
+      "4096x4096: 1 GPU:   1.3172 s, 2 GPUs:   0.6964 s, speedup:     1.89, efficiency:    94.57%\n",
+      "MPI time:   0.0561 s, inter GPU BW:     2.17 GiB/s\n"
      ]
     }
    ],
@@ -1372,21 +1372,21 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
-      "Job <25202> is submitted to default queue <batch>.\n",
+      "Job <25671> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==93249== PGPROF is profiling process 93249, command: ./poisson2d.solution 10\n",
-      "==93248== PGPROF is profiling process 93248, command: ./poisson2d.solution 10\n",
-      "==93249== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.1.pgprof\n",
+      "==79190== PGPROF is profiling process 79190, command: ./poisson2d.solution 10\n",
+      "==79192== PGPROF is profiling process 79192, command: ./poisson2d.solution 10\n",
+      "==79192== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.1.pgprof\n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
       "Calculate reference solution and time serial execution.\n",
       "    0, 0.250000\n",
       "Parallel execution.\n",
       "    0, 0.250000\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   0.0262 s, 2 GPUs:   0.0127 s, speedup:     2.06, efficiency:   103.02%\n",
-      "MPI time:   0.0009 s, inter GPU BW:     1.39 GiB/s\n",
-      "==93248== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.0.pgprof\n",
+      "4096x4096: 1 GPU:   0.0301 s, 2 GPUs:   0.0126 s, speedup:     2.39, efficiency:   119.53%\n",
+      "MPI time:   0.0009 s, inter GPU BW:     1.34 GiB/s\n",
+      "==79190== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.0.pgprof\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task3.NP2.?.pgprof  .\n",
       "tar -cvzf pgprof.poisson2d.Task3.solution.tar.gz poisson2d.solution.Task3.NP2.?.pgprof\n",
       "poisson2d.solution.Task3.NP2.0.pgprof\n",
@@ -1462,30 +1462,30 @@
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
-       "      <td>1.3935</td>\n",
-       "      <td>0.94,</td>\n",
-       "      <td>93.86%</td>\n",
+       "      <td>1.3815</td>\n",
+       "      <td>0.95,</td>\n",
+       "      <td>94.79%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2</td>\n",
-       "      <td>0.6910</td>\n",
-       "      <td>1.89,</td>\n",
-       "      <td>94.52%</td>\n",
+       "      <td>0.6968</td>\n",
+       "      <td>1.90,</td>\n",
+       "      <td>94.91%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>4</td>\n",
-       "      <td>0.3920</td>\n",
-       "      <td>3.37,</td>\n",
-       "      <td>84.13%</td>\n",
+       "      <td>0.3990</td>\n",
+       "      <td>3.30,</td>\n",
+       "      <td>82.56%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>6</td>\n",
-       "      <td>0.2841</td>\n",
-       "      <td>4.58,</td>\n",
-       "      <td>76.29%</td>\n",
+       "      <td>0.2720</td>\n",
+       "      <td>4.81,</td>\n",
+       "      <td>80.18%</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
@@ -1493,10 +1493,10 @@
       ],
       "text/plain": [
        "   GPUs  time [s] speedup efficiency\n",
-       "0     1    1.3935   0.94,     93.86%\n",
-       "1     2    0.6910   1.89,     94.52%\n",
-       "2     4    0.3920   3.37,     84.13%\n",
-       "3     6    0.2841   4.58,     76.29%"
+       "0     1    1.3815   0.95,     94.79%\n",
+       "1     2    0.6968   1.90,     94.91%\n",
+       "2     4    0.3990   3.30,     82.56%\n",
+       "3     6    0.2720   4.81,     80.18%"
       ]
      },
      "execution_count": 21,
@@ -1543,13 +1543,14 @@
     "## Solution 4:<a name=\"solution4\"></a>\n",
     "\n",
     "\n",
-    "Include NVSHMEM headers\n",
+    "First, include NVSHMEM headers\n",
     "\n",
     "```C\n",
     "#include <nvshmem.h>\n",
     "#include <nvshmemx.h>\n",
     "```\n",
-    "and initalize NVSHMEM with MPI\n",
+    "\n",
+    "and initialize NVSHMEM with MPI\n",
     "```C\n",
     "MPI_Comm mpi_comm = MPI_COMM_WORLD;\n",
     "nvshmemx_init_attr_t attr;\n",
@@ -1666,7 +1667,7 @@
       "     65, Generating Tesla code\n",
       "         67, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
       "     77, Generating update self(Aref[:ny*nx])\n",
-      "mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/gpfs/wolf/trn003/world-shared/software/nvshmem//include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/gpfs/wolf/trn003/world-shared/software/nvshmem//lib -lnvshmem -Mcuda -lcuda -lrt \n",
+      "mpicxx -DUSE_DOUBLE  -Minfo=accel -fast -acc -ta=tesla:cc70,pinned -I/gpfs/wolf/trn003/world-shared/software/nvshmem/include poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution -L/gpfs/wolf/trn003/world-shared/software/nvshmem/lib -lnvshmem -Mcuda -lcuda -lrt \n",
       "poisson2d.solution.c:\n",
       "main:\n",
       "     90, Generating enter data create(Aref[:ny*nx],rhs[:ny*nx],A[:ny*nx],Anew[:ny*nx])\n",
@@ -1712,7 +1713,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
-      "Job <25207> is submitted to default queue <batch>.\n",
+      "Job <25676> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
@@ -1740,8 +1741,8 @@
       "  800, 0.249524\n",
       "  900, 0.249464\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   1.3171 s, 2 GPUs:   0.7377 s, speedup:     1.79, efficiency:    89.27%\n",
-      "MPI time:   0.0686 s, inter GPU BW:     1.78 GiB/s\n"
+      "4096x4096: 1 GPU:   1.3188 s, 2 GPUs:   0.7398 s, speedup:     1.78, efficiency:    89.13%\n",
+      "MPI time:   0.0644 s, inter GPU BW:     1.90 GiB/s\n"
      ]
     }
    ],
@@ -1762,12 +1763,11 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
-      "Job <25208> is submitted to default queue <batch>.\n",
+      "Job <25677> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==93971== PGPROF is profiling process 93971, command: ./poisson2d.solution 10\n",
-      "==93970== PGPROF is profiling process 93970, command: ./poisson2d.solution 10\n",
-      "==93971== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.0.pgprof\n",
+      "==79915== PGPROF is profiling process 79915, command: ./poisson2d.solution 10\n",
+      "==79914== PGPROF is profiling process 79914, command: ./poisson2d.solution 10\n",
       "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
       "Calculate reference solution and time serial execution.\n",
@@ -1775,9 +1775,10 @@
       "Parallel execution.\n",
       "    0, 0.250000\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0132 s, speedup:     1.71, efficiency:    85.34%\n",
-      "MPI time:   0.0010 s, inter GPU BW:     1.24 GiB/s\n",
-      "==93970== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.1.pgprof\n",
+      "4096x4096: 1 GPU:   0.0226 s, 2 GPUs:   0.0131 s, speedup:     1.72, efficiency:    86.13%\n",
+      "MPI time:   0.0010 s, inter GPU BW:     1.27 GiB/s\n",
+      "==79915== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.0.pgprof\n",
+      "==79914== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.1.pgprof\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task4.NP2.?.pgprof  .\n",
       "tar -cvzf pgprof.poisson2d.Task4.solution.tar.gz poisson2d.solution.Task4.NP2.?.pgprof\n",
       "poisson2d.solution.Task4.NP2.0.pgprof\n",
@@ -1853,30 +1854,30 @@
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
-       "      <td>1.3685</td>\n",
+       "      <td>1.3714</td>\n",
        "      <td>0.96,</td>\n",
-       "      <td>96.08%</td>\n",
+       "      <td>95.91%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2</td>\n",
-       "      <td>0.7472</td>\n",
-       "      <td>1.78,</td>\n",
-       "      <td>88.90%</td>\n",
+       "      <td>0.7460</td>\n",
+       "      <td>1.76,</td>\n",
+       "      <td>88.19%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>4</td>\n",
-       "      <td>0.4605</td>\n",
-       "      <td>2.85,</td>\n",
-       "      <td>71.27%</td>\n",
+       "      <td>0.4706</td>\n",
+       "      <td>2.80,</td>\n",
+       "      <td>70.05%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>6</td>\n",
-       "      <td>0.3612</td>\n",
-       "      <td>3.60,</td>\n",
-       "      <td>60.05%</td>\n",
+       "      <td>0.3308</td>\n",
+       "      <td>3.91,</td>\n",
+       "      <td>65.18%</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
@@ -1884,10 +1885,10 @@
       ],
       "text/plain": [
        "   GPUs  time [s] speedup efficiency\n",
-       "0     1    1.3685   0.96,     96.08%\n",
-       "1     2    0.7472   1.78,     88.90%\n",
-       "2     4    0.4605   2.85,     71.27%\n",
-       "3     6    0.3612   3.60,     60.05%"
+       "0     1    1.3714   0.96,     95.91%\n",
+       "1     2    0.7460   1.76,     88.19%\n",
+       "2     4    0.4706   2.80,     70.05%\n",
+       "3     6    0.3308   3.91,     65.18%"
       ]
      },
      "execution_count": 26,
@@ -2036,16 +2037,16 @@
       "        142, #pragma acc loop vector(128) /* threadIdx.x */\n",
       "        146, Generating implicit reduction(max:error)\n",
       "    142, Loop is parallelizable\n",
-      "    152, Generating present(Anew[:],A[:])\n",
+      "    154, Generating present(Anew[:],A[:])\n",
       "         Generating Tesla code\n",
-      "        155, #pragma acc loop gang /* blockIdx.x */\n",
-      "        156, #pragma acc loop vector(128) /* threadIdx.x */\n",
-      "    156, Loop is parallelizable\n",
-      "    190, Generating present(A[:])\n",
+      "        157, #pragma acc loop gang /* blockIdx.x */\n",
+      "        158, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    158, Loop is parallelizable\n",
+      "    192, Generating present(A[:])\n",
       "         Generating Tesla code\n",
-      "        193, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
-      "    203, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
-      "    221, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+      "        195, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    205, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    223, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
      ]
     }
    ],
@@ -2066,7 +2067,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
-      "Job <25213> is submitted to default queue <batch>.\n",
+      "Job <25682> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
@@ -2094,7 +2095,7 @@
       "  800, 0.249524\n",
       "  900, 0.249464\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   1.3176 s, 2 GPUs:   0.6777 s, speedup:     1.94, efficiency:    97.22%\n"
+      "4096x4096: 1 GPU:   1.3210 s, 2 GPUs:   0.6750 s, speedup:     1.96, efficiency:    97.86%\n"
      ]
     }
    ],
@@ -2115,12 +2116,12 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
-      "Job <25214> is submitted to default queue <batch>.\n",
+      "Job <25683> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==94705== PGPROF is profiling process 94705, command: ./poisson2d.solution 10\n",
-      "==94707== PGPROF is profiling process 94707, command: ./poisson2d.solution 10\n",
-      "==94707== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.1.pgprof\n",
+      "==80646== PGPROF is profiling process 80646, command: ./poisson2d.solution 10\n",
+      "==80644== PGPROF is profiling process 80644, command: ./poisson2d.solution 10\n",
+      "==80646== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.0.pgprof\n",
       "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
       "Calculate reference solution and time serial execution.\n",
@@ -2128,8 +2129,8 @@
       "Parallel execution.\n",
       "    0, 0.250000\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0117 s, speedup:     1.92, efficiency:    96.05%\n",
-      "==94705== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.0.pgprof\n",
+      "4096x4096: 1 GPU:   0.0227 s, 2 GPUs:   0.0120 s, speedup:     1.89, efficiency:    94.65%\n",
+      "==80644== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.1.pgprof\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task5.NP2.?.pgprof  .\n",
       "tar -cvzf pgprof.poisson2d.Task5.solution.tar.gz poisson2d.solution.Task5.NP2.?.pgprof\n",
       "poisson2d.solution.Task5.NP2.0.pgprof\n",
@@ -2205,30 +2206,30 @@
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
-       "      <td>1.2915</td>\n",
-       "      <td>1.02,</td>\n",
-       "      <td>101.63%</td>\n",
+       "      <td>1.3004</td>\n",
+       "      <td>1.01,</td>\n",
+       "      <td>101.04%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2</td>\n",
-       "      <td>0.6742</td>\n",
-       "      <td>1.96,</td>\n",
-       "      <td>98.08%</td>\n",
+       "      <td>0.6705</td>\n",
+       "      <td>1.95,</td>\n",
+       "      <td>97.67%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>4</td>\n",
-       "      <td>0.3801</td>\n",
-       "      <td>3.47,</td>\n",
-       "      <td>86.66%</td>\n",
+       "      <td>0.3879</td>\n",
+       "      <td>3.41,</td>\n",
+       "      <td>85.14%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>6</td>\n",
-       "      <td>0.2733</td>\n",
-       "      <td>4.80,</td>\n",
-       "      <td>80.04%</td>\n",
+       "      <td>0.2745</td>\n",
+       "      <td>4.81,</td>\n",
+       "      <td>80.25%</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
@@ -2236,10 +2237,10 @@
       ],
       "text/plain": [
        "   GPUs  time [s] speedup efficiency\n",
-       "0     1    1.2915   1.02,    101.63%\n",
-       "1     2    0.6742   1.96,     98.08%\n",
-       "2     4    0.3801   3.47,     86.66%\n",
-       "3     6    0.2733   4.80,     80.04%"
+       "0     1    1.3004   1.01,    101.04%\n",
+       "1     2    0.6705   1.95,     97.67%\n",
+       "2     4    0.3879   3.41,     85.14%\n",
+       "3     6    0.2745   4.81,     80.25%"
       ]
      },
      "execution_count": 31,
@@ -2282,7 +2283,35 @@
     "exercise": "solution"
    },
    "source": [
-    "## Solution 6:<a name=\"solution6\"></a> TODO\n",
+    "## Solution 6:<a name=\"solution6\"></a>\n",
+    "\n",
+    "\n",
+    "The most important part here is to get an `nvshmem_ptr` pointing to the symmetric `d_A` allocation of your top and bottom neighbor.\n",
+    "```C\n",
+    "real * restrict d_Atop = (real *)nvshmem_ptr(d_A, top);\n",
+    "real * restrict d_Abottom = (real *)nvshmem_ptr(d_A, bottom);\n",
+    "```\n",
+    "\n",
+    "When updating `A` from Anew make sure to also update `A` on your top and bottom neighbor if you are at the boundary:\n",
+    "```C\n",
+    "#pragma acc parallel loop present(A, Anew) deviceptr(d_Atop, d_Abottom) async\n",
+    "for (int iy = iy_start; iy < iy_end; iy++) {\n",
+    "    for (int ix = ix_start; ix < ix_end; ix++) {\n",
+    "        A[iy * nx + ix] = Anew[iy * nx + ix];\n",
+    "        if(iy == iy_start){// this also needs to go to the lower halo region of my upper neighbor\n",
+    "            d_Atop[iy_end_top * nx + ix] = Anew[iy * nx + ix];\n",
+    "        }\n",
+    "        if(iy == iy_end -1){// this also needs to go to the upper halo region of my bottom neighbor\n",
+    "            d_Abottom[(iy_start_bottom - 1) * nx + ix]  =  Anew[iy * nx + ix];\n",
+    "        }\n",
+    "    }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "We can then remove the explicit `nvhsmem_put` calls on completely. But remember to still keep the barrier.\n",
+    "```C\n",
+    "nvshmemx_barrier_all_on_stream((cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));\n",
+    "````\n",
     "\n",
     "\n",
     "\n",
@@ -2361,23 +2390,23 @@
       "        110, #pragma acc loop gang /* blockIdx.x */\n",
       "        111, #pragma acc loop vector(128) /* threadIdx.x */\n",
       "    111, Loop is parallelizable\n",
-      "    159, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
-      "    160, Generating present(A[:],rhs[:],Anew[:])\n",
+      "    158, Generating update device(rhs[nx*iy_start:nx*(iy_end-iy_start)],A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    159, Generating present(A[:],rhs[:],Anew[:])\n",
       "         Generating Tesla code\n",
-      "        165, #pragma acc loop gang /* blockIdx.x */\n",
-      "        166, #pragma acc loop vector(128) /* threadIdx.x */\n",
-      "        170, Generating implicit reduction(max:error)\n",
-      "    166, Loop is parallelizable\n",
-      "    176, Generating present(Anew[:],A[:])\n",
+      "        162, #pragma acc loop gang /* blockIdx.x */\n",
+      "        163, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "        167, Generating implicit reduction(max:error)\n",
+      "    163, Loop is parallelizable\n",
+      "    174, Generating present(Anew[:],A[:])\n",
       "         Generating Tesla code\n",
-      "        179, #pragma acc loop gang /* blockIdx.x */\n",
-      "        181, #pragma acc loop vector(128) /* threadIdx.x */\n",
-      "    181, Loop is parallelizable\n",
-      "    192, Generating present(A[:])\n",
+      "        177, #pragma acc loop gang /* blockIdx.x */\n",
+      "        179, #pragma acc loop vector(128) /* threadIdx.x */\n",
+      "    179, Loop is parallelizable\n",
+      "    190, Generating present(A[:])\n",
       "         Generating Tesla code\n",
-      "        195, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
-      "    205, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
-      "    224, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
+      "        193, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */\n",
+      "    203, Generating update self(A[nx*(iy_start-1):nx*((iy_end-iy_start)+2)])\n",
+      "    219, Generating exit data delete(rhs[:1],Aref[:1],A[:1],Anew[:1])\n"
      ]
     }
    ],
@@ -2398,7 +2427,7 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" ./poisson2d.solution\n",
-      "Job <25219> is submitted to default queue <batch>.\n",
+      "Job <25688> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
       "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
@@ -2426,8 +2455,7 @@
       "  800, 0.249524\n",
       "  900, 0.249464\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   1.3157 s, 2 GPUs:   0.6533 s, speedup:     2.01, efficiency:   100.70%\n",
-      "MPI time:   0.0000 s, inter GPU BW:      inf GiB/s\n"
+      "4096x4096: 1 GPU:   1.3196 s, 2 GPUs:   0.6641 s, speedup:     1.99, efficiency:    99.34%\n"
      ]
     }
    ],
@@ -2448,12 +2476,12 @@
      "output_type": "stream",
      "text": [
       "bsub -W 60 -nnodes 1 -Is -P TRN003 jsrun -n 1 -c 1 -g ALL_GPUS -a 2 -c ALL_CPUS -d cyclic -b packed:7 --smpiargs \"-gpu\" pgprof -f --cpu-profiling off --openmp-profiling off --annotate-mpi openmpi -o /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.%q{OMPI_COMM_WORLD_RANK}.pgprof ./poisson2d.solution 10\n",
-      "Job <25220> is submitted to default queue <batch>.\n",
+      "Job <25689> is submitted to default queue <batch>.\n",
       "<<Waiting for dispatch ...>>\n",
       "<<Starting on login1>>\n",
-      "==95445== PGPROF is profiling process 95445, command: ./poisson2d.solution 10\n",
-      "==95446== PGPROF is profiling process 95446, command: ./poisson2d.solution 10\n",
-      "==95445== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.1.pgprof\n",
+      "==81382== PGPROF is profiling process 81382, command: ./poisson2d.solution 10\n",
+      "==81383== PGPROF is profiling process 81383, command: ./poisson2d.solution 10\n",
+      "==81382== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.1.pgprof\n",
       "WARN: IB HCA and GPU are not connected to a PCIe switch so IB performance can be limited depending on the CPU generation \n",
       "Jacobi relaxation Calculation: 4096 x 4096 mesh\n",
       "Calculate reference solution and time serial execution.\n",
@@ -2461,9 +2489,8 @@
       "Parallel execution.\n",
       "    0, 0.250000\n",
       "Num GPUs: 2.\n",
-      "4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0116 s, speedup:     1.94, efficiency:    96.85%\n",
-      "MPI time:   0.0000 s, inter GPU BW:      inf GiB/s\n",
-      "==95446== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.0.pgprof\n",
+      "4096x4096: 1 GPU:   0.0225 s, 2 GPUs:   0.0118 s, speedup:     1.91, efficiency:    95.50%\n",
+      "==81383== Generated result file: /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.0.pgprof\n",
       "mv /gpfs/wolf/trn003/scratch/mathiasw//poisson2d.solution.Task6.NP2.?.pgprof  .\n",
       "tar -cvzf pgprof.poisson2d.Task6.solution.tar.gz poisson2d.solution.Task6.NP2.?.pgprof\n",
       "poisson2d.solution.Task6.NP2.0.pgprof\n",
@@ -2539,30 +2566,30 @@
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>1</td>\n",
-       "      <td>1.2869</td>\n",
-       "      <td>1.02,</td>\n",
-       "      <td>102.05%</td>\n",
+       "      <td>1.2964</td>\n",
+       "      <td>1.01,</td>\n",
+       "      <td>101.26%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>2</td>\n",
-       "      <td>0.6574</td>\n",
-       "      <td>1.99,</td>\n",
-       "      <td>99.26%</td>\n",
+       "      <td>0.6714</td>\n",
+       "      <td>1.94,</td>\n",
+       "      <td>96.87%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>4</td>\n",
-       "      <td>0.3670</td>\n",
-       "      <td>3.59,</td>\n",
-       "      <td>89.71%</td>\n",
+       "      <td>0.3810</td>\n",
+       "      <td>3.46,</td>\n",
+       "      <td>86.47%</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>6</td>\n",
-       "      <td>0.2450</td>\n",
-       "      <td>5.37,</td>\n",
-       "      <td>89.42%</td>\n",
+       "      <td>0.2641</td>\n",
+       "      <td>4.87,</td>\n",
+       "      <td>81.16%</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
@@ -2570,10 +2597,10 @@
       ],
       "text/plain": [
        "   GPUs  time [s] speedup efficiency\n",
-       "0     1    1.2869   1.02,    102.05%\n",
-       "1     2    0.6574   1.99,     99.26%\n",
-       "2     4    0.3670   3.59,     89.71%\n",
-       "3     6    0.2450   5.37,     89.42%"
+       "0     1    1.2964   1.01,    101.26%\n",
+       "1     2    0.6714   1.94,     96.87%\n",
+       "2     4    0.3810   3.46,     86.47%\n",
+       "3     6    0.2641   4.87,     81.16%"
       ]
      },
      "execution_count": 36,
@@ -2601,7 +2628,7 @@
     "exercise": "solution"
    },
    "source": [
-    "The missing of device copies can be seen in the profiler, e.g. as shown below.\n",
+    "The missing of device copies can be seen in the profiler, e.g. as shown below. There are only kernels running mostly back-to-back, only interrupted by the global reduction.\n",
     "\n",
     "![Solution6.png](./resources/Solution6.png)\n",
     "\n",