Supercomputing 2019 Tutorial "Application Porting and Optimization on GPU-Accelerated POWER Architectures", November 18th 2019
This contains the output for the solutions.
The solutions are described in the solution section. Please navigate to the corresponding directory to find the solution profiles and sources.
–ta=tesla:managed
Please select your language choice (C or FORTRAN) below by making sure your choice is uncommented and comment out the other language. Then execute the cell by hitting Shift+Enter
!
# select language here
LANGUAGE='C'
#LANGUAGE='FORTRAN'
## You should not touch the remaining code in the cell
import os.path
import pandas
try: rootdir
except NameError: rootdir = None
if(not rootdir):
rootdir=%pwd
basedir=os.path.join(rootdir,LANGUAGE)
basedirC=os.path.join(rootdir,'C')
print ("You selected {} for the exercises.".format(LANGUAGE))
def checkdir(dir):
d=%pwd
assert(d.endswith(dir) or d.endswith(dir+'p') or d.endswith(dir+'m')), "Please make sure to cd to the right directory first."
def cleanall():
# clean up everything -- use with care
for t in range(4):
d='%s/task%i'%(basedir,t)
%cd $d
!make clean
#cleanall()
%cd $basedir/task0
Below are suggested solutions. This is only a short description of the solution, but the poisson2d.solution.(c|F03)
files linked below have the full source code. If you want to run / profile the solutions feel free to duplicate the cells for the tasks and change the make target to the *.solution
ones.
#pragma acc parallel loop
for (int ix = ix_start; ix < ix_end; ix++)
{
#pragma acc loop
for( int iy = iy_start; iy < iy_end; iy++ )
{
Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1]
+ A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix] ));
error = fmaxr( error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));
}
}
You can compile, run and profile the solution with the next cells. After the profiling finished the output file poisson2d.solution.pgprof
can be downloaded from here: C Version / Fortran Version.
%cd $basedir/task0
checkdir('task0')
!make poisson2d.solution
checkdir('task0')
!make run.solution
checkdir('task0')
!make profile.solution
Swap the ix
and iy
loops to make sure that ix
is the fastest running index
#pragma acc parallel loop
for (int iy = iy_start; iy < iy_end; iy++)
{
for( int ix = ix_start; ix < ix_end; ix++ )
{
Anew[iy*nx+ix] = -0.25 * (rhs[iy*nx+ix] - ( A[iy*nx+ix+1] + A[iy*nx+ix-1]
+ A[(iy-1)*nx+ix] + A[(iy+1)*nx+ix] ));
error = fmaxr( error, fabsr(Anew[iy*nx+ix]-A[iy*nx+ix]));
}
}
You can compile, run and profile the solution with the next cells. After the profiling finished the output file poisson2d.solution.pgprof
can be downloaded from here: C Version / Fortran Version.
%cd $basedir/task1
checkdir('task1')
!make poisson2d.solution
checkdir('task1')
!make run.solution
checkdir('task1')
!make profile.solution
For the Global Memory Load/Store Efficiency the make profile
command also generated a CSV file that you can import and view with the cell below.
If you purely work in a terminal you can view the same output by running pgprof -i poisson2d.efficiency.solution.pgprof
.
data_frame_solution = pandas.read_csv('poisson2d.solution.efficiency.csv', sep=',')
data_frame_solution
Set the GPU used by the rank using #pragma acc set device_num
//Initialize MPI and determine rank and size
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
#pragma acc set device_num( rank )
real* restrict const A = (real*) malloc(nx*ny*sizeof(real));
real* restrict const Aref = (real*) malloc(nx*ny*sizeof(real));
real* restrict const Anew = (real*) malloc(nx*ny*sizeof(real));
real* restrict const rhs = (real*) malloc(nx*ny*sizeof(real));
Apply domain decomposition
// Ensure correctness if ny%size != 0
int chunk_size = ceil( (1.0*ny)/size );
int iy_start = rank * chunk_size;
int iy_end = iy_start + chunk_size;
// Do not process boundaries
iy_start = max( iy_start, 1 );
iy_end = min( iy_end, ny - 1 );
Exchange data
//Periodic boundary conditions
int top = (rank == 0) ? (size-1) : rank-1;
int bottom = (rank == (size-1)) ? 0 : rank+1;
#pragma acc host_data use_device( A )
{
double start_mpi = MPI_Wtime();
//1. Sent row iy_start (first modified row) to top receive lower boundary (iy_end) from bottom
MPI_Sendrecv( A+iy_start*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top , 0,
A+iy_end*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
//2. Sent row (iy_end-1) (last modified row) to bottom receive upper boundary (iy_start-1) from top
MPI_Sendrecv( A+(iy_end-1)*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,
A+(iy_start-1)*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top , 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
mpi_time += MPI_Wtime() - start_mpi;
}
You can compile, run and profile the solution with the next cells. You can profile the code by executing the next cell. After the profiling completed download the tarball containing the profiles (pgprof.Task2.solution.poisson2d.tar.gz
) with the File Browser.
Then you can import them into pgprof / nvvp using the Import option in the File menu. Remember to use the Multiple processes option in the assistant.
%cd $basedir/task2
checkdir('task2')
!make poisson2d.solution
checkdir('task2')
!NP=2 make run.solution
checkdir('task2')
!NP=2 make profile.solution
You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell.
checkdir('task2')
!NP=1 make run.solution | grep speedup > scale.out
!NP=2 make run.solution | grep speedup >> scale.out
!NP=4 make run.solution | grep speedup >> scale.out
!NP=6 make run.solution | grep speedup >> scale.out
data_frameS2 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)
!rm scale.out
data_frameS2b=data_frameS2.iloc[:,[5,7,10,12]].copy()
data_frameS2b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})
Update the boundaries first.
#pragma acc parallel loop present(A,Anew)
for( int ix = ix_start; ix < ix_end; ix++ )
{
A[(iy_start)*nx+ix] = Anew[(iy_start)*nx+ix];
A[(iy_end-1)*nx+ix] = Anew[(iy_end-1)*nx+ix];
}
Start the interior loop asynchronously so it can overlap with the MPI communication and wait at the end for the completion.
#pragma acc parallel loop present(A,Anew) async
for (int iy = iy_start+1; iy < iy_end-1; iy++)
{
for( int ix = ix_start; ix < ix_end; ix++ )
{
A[iy*nx+ix] = Anew[iy*nx+ix];
}
}
//Periodic boundary conditions
int top = (rank == 0) ? (size-1) : rank-1;
int bottom = (rank == (size-1)) ? 0 : rank+1;
#pragma acc host_data use_device( A )
{
double start_mpi = MPI_Wtime();
//1. Sent row iy_start (first modified row) to top receive lower boundary (iy_end) from bottom
MPI_Sendrecv( A+iy_start*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top , 0,
A+iy_end*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
//2. Sent row (iy_end-1) (last modified row) to bottom receive upper boundary (iy_start-1) from top
MPI_Sendrecv( A+(iy_end-1)*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, bottom, 0,
A+(iy_start-1)*nx+ix_start, (ix_end-ix_start), MPI_REAL_TYPE, top , 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE );
mpi_time += MPI_Wtime() - start_mpi;
}
#pragma acc wait
You can compile, run and profile the solution with the next cells. You can profile the code by executing the next cell. After the profiling completed download the tarball containing the profiles (pgprof.Task2.solution.poisson2d.tar.gz
) with the File Browser.
Then you can import them into pgprof / nvvp using the Import option in the File menu. Remember to use the Multiple processes option in the assistant.
%cd $basedir/task3
checkdir('task3')
!make poisson2d.solution
checkdir('task3')
!NP=2 make run.solution
checkdir('task3')
!NP=2 make profile.solution
You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell.
checkdir('task3')
!NP=1 make run.solution | grep speedup > scale.out
!NP=2 make run.solution | grep speedup >> scale.out
!NP=4 make run.solution | grep speedup >> scale.out
!NP=6 make run.solution | grep speedup >> scale.out
data_frameS3 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)
!rm scale.out
data_frameS3b=data_frameS3.iloc[:,[5,7,10,12]].copy()
data_frameS3b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})
The overlap of compute and communication can be seen in the profiler, e.g. as shown below.
First, include NVSHMEM headers
#include <nvshmem.h>
#include <nvshmemx.h>
and initialize NVSHMEM with MPI
MPI_Comm mpi_comm = MPI_COMM_WORLD;
nvshmemx_init_attr_t attr;
attr.mpi_comm = &mpi_comm;
nvshmemx_init_attr(NVSHMEMX_INIT_WITH_MPI_COMM, &attr);
Allocate device memory and map it top the host allocation for OpenACC
real *d_A = (real *)nvshmem_malloc(nx * ny * sizeof(real));
map(A, d_A, nx * ny * sizeof(real));
Calculate the right locations on the remote GPUs and communicate data
// Periodic boundary conditions
int top = (rank == 0) ? (size - 1) : rank - 1;
int bottom = (rank == (size - 1)) ? 0 : rank + 1;
int iy_start_top = top * chunk_size;
int iy_end_top = iy_start_top + chunk_size;
// Do not process boundaries
iy_start_top = max(iy_start_top, 1);
iy_end_top = min(iy_end_top, ny - 1);
int iy_start_bottom = bottom * chunk_size;
int iy_end_bottom = iy_start_bottom + chunk_size;
// Do not process boundaries
iy_start_bottom = max(iy_start_bottom, 1);
iy_end_bottom = min(iy_end_bottom, ny - 1);
// Halo exchange
#pragma acc host_data use_device(A)
{
double start_mpi = MPI_Wtime();
nvshmem_double_put((double *)(A + iy_end_top * nx + ix_start),
(double *)(A + iy_start * nx + ix_start), (ix_end - ix_start), top);
nvshmem_double_put((double *)(A + (iy_start_bottom - 1) * nx + ix_start),
(double *)(A + (iy_end - 1) * nx + ix_start), (ix_end - ix_start),
bottom);
nvshmem_barrier_all();
mpi_time += MPI_Wtime() - start_mpi;
}
Finally, remember to deallocate:
nvshmem_free(d_A);
You can compile, run and profile the solution with the next cells. You can profile the code by executing the next cell. After the profiling completed download the tarball containing the profiles (pgprof.Task4.solution.poisson2d.tar.gz
) with the File Browser.
Then you can import them into pgprof / nvvp using the Import option in the File menu. Remember to use the Multiple processes option in the assistant.
%cd $basedir/task4
checkdir('task4')
!make poisson2d.solution
checkdir('task4')
!NP=2 make run.solution
checkdir('task4')
!NP=2 make profile.solution
You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell.
checkdir('task4')
!NP=1 make run.solution | grep speedup > scale.out
!NP=2 make run.solution | grep speedup >> scale.out
!NP=4 make run.solution | grep speedup >> scale.out
!NP=6 make run.solution | grep speedup >> scale.out
data_frameS4 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)
!rm scale.out
data_frameS4b=data_frameS4.iloc[:,[5,7,10,12]].copy()
data_frameS4b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})
The communication using NVSHMEM and the barrier executed as a kernel on the device can be seen in the profiler, e.g. as shown below.
Basically all kernels in the while
loop can use the async keyword. Please take a look in the solution source code. They will all use the OpenACC default async queue.
To also place the halo exchange in the queue use:
#pragma acc host_data use_device(A)
{
nvshmemx_double_put_on_stream(
(double *)(A + iy_end_top * nx + ix_start),
(double *)(A + iy_start * nx + ix_start), (ix_end - ix_start), top,
(cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));
nvshmemx_double_put_on_stream(
(double *)(A + (iy_start_bottom - 1) * nx + ix_start),
(double *)(A + (iy_end - 1) * nx + ix_start), (ix_end - ix_start), bottom,
(cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));
}
nvshmemx_barrier_all_on_stream((cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));
Finally when copying out data make sure to wait on all device computation first:
#pragma acc update self(A [(iy_start - 1) * nx:((iy_end - iy_start) + 2) * nx]) wait
You can compile, run and profile the solution with the next cells. You can profile the code by executing the next cell. After the profiling completed download the tarball containing the profiles (pgprof.Task5.solution.poisson2d.tar.gz
) with the File Browser.
Then you can import them into pgprof / nvvp using the Import option in the File menu. Remember to use the Multiple processes option in the assistant.
%cd $basedir/task5
checkdir('task5')
!make poisson2d.solution
checkdir('task5')
!NP=2 make run.solution
checkdir('task5')
!NP=2 make profile.solution
You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell.
checkdir('task5')
!NP=1 make run.solution | grep speedup > scale.out
!NP=2 make run.solution | grep speedup >> scale.out
!NP=4 make run.solution | grep speedup >> scale.out
!NP=6 make run.solution | grep speedup >> scale.out
data_frameS5 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)
!rm scale.out
data_frameS5b=data_frameS5.iloc[:,[5,7,10,12]].copy()
data_frameS5b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})
The asynchronous execution and execution in the same stream can be seen in the profiler, e.g. as shown below.
The most important part here is to get an nvshmem_ptr
pointing to the symmetric d_A
allocation of your top and bottom neighbor.
real * restrict d_Atop = (real *)nvshmem_ptr(d_A, top);
real * restrict d_Abottom = (real *)nvshmem_ptr(d_A, bottom);
When updating A
from Anew make sure to also update A
on your top and bottom neighbor if you are at the boundary:
#pragma acc parallel loop present(A, Anew) deviceptr(d_Atop, d_Abottom) async
for (int iy = iy_start; iy < iy_end; iy++) {
for (int ix = ix_start; ix < ix_end; ix++) {
A[iy * nx + ix] = Anew[iy * nx + ix];
if(iy == iy_start){// this also needs to go to the lower halo region of my upper neighbor
d_Atop[iy_end_top * nx + ix] = Anew[iy * nx + ix];
}
if(iy == iy_end -1){// this also needs to go to the upper halo region of my bottom neighbor
d_Abottom[(iy_start_bottom - 1) * nx + ix] = Anew[iy * nx + ix];
}
}
}
We can then remove the explicit nvhsmem_put
calls on completely. But remember to still keep the barrier.
nvshmemx_barrier_all_on_stream((cudaStream_t)acc_get_cuda_stream(acc_get_default_async()));
`
You can compile, run and profile the solution with the next cells. You can profile the code by executing the next cell. After the profiling completed download the tarball containing the profiles (pgprof.Task6.solution.poisson2d.tar.gz
) with the File Browser.
Then you can import them into pgprof / nvvp using the Import option in the File menu. Remember to use the Multiple processes option in the assistant.
%cd $basedir/task6
checkdir('task6')
!make poisson2d.solution
checkdir('task6')
!NP=2 make run.solution
checkdir('task6')
!NP=2 make profile.solution
You can do a simple scaling run for up to all 6 GPUs in the node by executing the next cell.
checkdir('task6')
!NP=1 make run.solution | grep speedup > scale.out
!NP=2 make run.solution | grep speedup >> scale.out
!NP=4 make run.solution | grep speedup >> scale.out
!NP=6 make run.solution | grep speedup >> scale.out
data_frameS5 = pandas.read_csv('scale.out', delim_whitespace=True, header=None)
!rm scale.out
data_frameS5b=data_frameS5.iloc[:,[5,7,10,12]].copy()
data_frameS5b.rename(columns={5:'GPUs', 7: 'time [s]', 10:'speedup', 12:'efficiency'})
The missing of device copies can be seen in the profiler, e.g. as shown below. There are only kernels running mostly back-to-back, only interrupted by the global reduction.