Supercomputing 2019 Tutorial "Application Porting and Optimization on GPU-Accelerated POWER Architectures", November 18th 2019
As for the first task of this tutorial, also this task is primarily designed to be executed as an interactive Jupyter Notebook. However, everything can also be done using an SSH connection to Ascent (or any other POWER9 computer) in your terminal.
When using Jupyter, this Notebook will guide you through the steps. Note that if you execute a cell multiple times while optimizng the code the output will be replaced. You can however duplicate the cell you want to execute and keep its output. Check the edit menu above.
You will always find links to a file browser of the corresponding task subdirectory as well as direct links to the source files you will need to edit as well as the profiling output you need to open locally.
If you want you also can get a terminal in your browser; just open it via the »New Launcher« button (+
).
The tasks are place in directories named Task[1-3]
.
Makefile targets are created to cover everything, from compile, to run and profile. Please take a look at the cells containing the make calls as a guide also for the non-interactive version of this description.
We are using some very fresh compiler features and use GCC 9.2.0 because of that. It should already be in your environment. Let's check!
!gcc --version
This session comes with multiple tasks, each one to be found in the respective sub-directory Task[1-3]
. In each of these directories you will also find Makefiles that are set up so that you can compile and submit all necessary tasks.
Please choose from the task below.
Improve performance of the CPU Jacobi solver with compiler flags such as Ofast
and profile-directed feedback. Learn about compiler annotations.
Improve performance of the CPU Jacobi solver with software prefetching. Some compilers such as IBM XL define flags that can be used to modify the aggressiveness of the hardware prefetcher. Learn to modify the DSCR value through XL and study the impact on application performance.
Parallelize the CPU Jacobi solver and determine the right binding to be used for optimal performance.
For all tasks we have defined the following make targets.
poisson2d
binary (default)poisson2d
with default parametersThe goal of this task is to understand different options available to optimize the performance of the CPU Jacobi solver
Your task is to:
-Ofast
flagFirst, change the working directory to Task1
.
%cd Task1
-Ofast
vs. -O3
¶We are to compare the performance of the binary being compiled with -Ofast
optimization and with -O3
optimization. As in the previous task, we use a Makefile
for compilation. The Makefile
targets poisson2d_O3
and poisson2d_Ofast
are already prepared.
TASK: Add -O3
as the optimization flag for the poisson2d_O3
target by using the corresponding CFLAGS
definition. There are notes relating to this Task 1 in the header of the Makefile
. Compile the code using make
as indicated below and run with the Make
targets run
, run_perf
and run_perf_recrep
.
!make poisson2d_O3
!make run
Let's have a look at the output of the Makefile
target run_perf
. It invokes the GNU perf tool to print out details of the number of instructions executed and the number of cycles taken by POWER9 to execute the program. Feel free to add further counter to this call to perf.
!make run_perf
Next we run the makefile with target run_perf_recrep
that prints the top routines of the application in terms of hotness by using a combination of perf record ./app
and perf report
.
# run_perf_recrep displays the top hot routines
!make run_perf_recrep
TASK: Now add the optimization flag Ofast
to the CFLAGS
for target poisson2d_Ofast
. Compile the program with the target poisson2d_Ofast
and run and analyse it as before with run
, run_perf
and run_perf_recrep
.
What difference do you see?
!make poisson2d_Ofast
!make run
Again, run a perf
-instrumented version:
!make run_perf
Generate the list of top routines in terms of hotness:
!make run_perf_recrep
If perf
is unavailable to you on other machines, you can also study the disassembly with objdump
: objdump -lSd ./poisson2d > poisson2d.dis
(feel free to experiment with this in the Notebook as well, just prefix the command with a !
to execute it.)
Depending on the application requirement, if a high precision of results is not mandatory, one can compile an application with -Ofast
which enables –ffast-math
option that implements the same math function in a relaxed manner very similar to how general mathematical expressions are implemented and avoids the overhead of calling a function from the math library. Comparing the files, you will see that the -Ofast
binary natively implements the fmax
function using instructions available in the hardware. The -O3
binary makes a library call to compute fmax
to follow a stricter IEEE requirement for accuracy.
For the first level of optimization we see that Ofast
cut the execution time of the O3
binary by almost half.
We can optimize the performance further by using profile-directed feedback optimization.
To compile using profile-directed feedback with the GCC compiler we need to build the appplication in three stages:
Step 1 is achieved by compiling the binary with the correct flag – -fprofile-generate
. In our case, we need to specify an output location, which should be $(SC19_DIR_SCRATCH)
.
Step 2 consists of a usual, albeit shorter run of the instrumented binary. The can be very short, though the parameters need to be representative of the actual run. After the binary ran, an output file (with file extension .gcda
) is written to the directory specified during compilation.
For Step 3, the binary is once again compiled, but this time using the gcda
profile just generated. The according flag is -fprofile-use
, which we set to $(SC19_DIR_SCRATCH)
as well.
In our Makefile
at hand, we prepared the steps already for you in the form of two targets.
poisson2d_train
: Will compile the binary with profile-directed feedbackpoisson2d_ref
: Will take a generated profile and compile a new, optimized binaryBy using dependencies, between these two targets a profile run is launched.
TASK: Edit the Makefile and add the -fprofile-*
flags to the CFLAGS
of poisson2d_train
and
poisson2d_ref
as outline in the file.
After that, you may launch them with the following cells (gen_profile
is a meta-target and uses poisson2d_train
and poisson2d_ref
). If you need to clean the generated profile, you may use make clean_profile
.
!make gen_profile
If the previous cell executed correctly, you now have your optimized executable. Let's see if it even fast than before!
!make run
Let's also measure instructions and cycles
!make run_perf
What is your speed-up? Feel free to run with larger problem sizes (mesh; iterations)
Usually, all compilers provide an option to emit annotations or remarks by the compiler. These remarks summarize the optimizations done in detail, the location in source where these optimizations were done. There exist options that also indicate optimizations that were missed and the reason why they could not be done.
To generate compiler annotations using GCC, one uses -fopt-info-all
. If you only want to see the missed options, use the option -fopt-info-missed
instead of -fopt-info-all
. See also the documentation of GCC regarding the flag.
TASK: Have a looK at the CFLAGS
of the Makefile
target poisson2d_Ofast_info
. Add the flag -fopt-info-all
to the list of flags. This will print optimisation information to stdout. If you rather want to print to this information to a file, use – for example – -fopt-info-all=(SC19_DIR_SCRATCH)/filename
.
!make poisson2d_Ofast_info
Let's compare this with the output during compilation when using profile-directed feedback from Task 1 B.
TASK:
Adapt the CFLAGS
of poisson2d_ref_info
to include -fopt-info-all
and the profile input of -fprofile-use=…
here. (Be advised: Long output!)
!make poisson2d_ref_info
Comparing the annotations generated of a plain -Ofast
optimization level and the one generated at -Ofast
and profile directed feedback, we observe that many more optimizations are possible due to profile information.
For instance you will see annotations such as
poisson2d.c:114:25: optimized: loop unrolled 3 times (header execution count 436550)
The execution count indicates the dynamic execution count of the node at runtime. This information determines which paths are hotter and subsequently facilitate additional optimizations.
But first, lets change directory to that of Task 2
%cd ../Task2
TASK: Look at the Makefile and work on the TODOs.
-Ofast
-optimised binary and note down the performance in terms of cycles, seconds, and L3 misses. This is our baseline!Makefile
to add the option for software prefetching (-fprefetch-loop-arrays
) – have a look at the TODO close to CFLAGS_GCC
. Compare performance of -Ofast
with and without software prefetching!make clean
!make poisson2d CC=gcc
!make run
!make l3missstats
!make poisson2d_pref CC=gcc
!make run
!make l3missstats
TASK: Repeat the experiment with the -O3
flag. Have a look at the Makefile
and the outlined TODO. There's a position to easily adapt -Ofast
→-O3
!
!make poisson2d CC=gcc -B
!make run
!make l3missstats
!make poisson2d_pref CC=gcc -B
!make run
!make l3missstats
Do you notice the impact difference with optimization levels? At what optimization level does software prefetching help the most?
Compilation of the -Ofast
binary with the software prefetching flag causes the compiler to generate the dcb*
instructions that prefetch memory values to L3.
TASK:
Run $(SC19_SUBMIT_CMD) objdump -lSd
on each binary file (-O3
, -Ofast
with prefetch/no prefetch).
Look for instructions beginning with dcb
At what optimization levels does the compiler generate software prefetching instructions?
!make CC=gcc -B poisson2d_pref
!objdump -lSd ./poisson2d > poisson2d.dis
!grep dcb poisson2d.dis
This task requires using the IBM XL compiler. It should be already in your environment.
We saw the impact of software prefetching in the previous subsection. In certain cases, tuning the hardware prefetcher through compiler options can also help improve performance. In this exercise we shall see some compiler options that can be used to modify the DSCR value which controls aggressiveness of prefetching. It can be also used to turn off hardware prefetching.
IBM XL compiler has an option -qprefetch=dscr=<val>
that can be used for this purpose.
Compiling with -qprefetch=dscr=1
turns off the prefetcher. One can give various values such as -qprefetch=dscr=4
, -qprefetch=dscr=7
etc. to control aggressiveness of prefetching.
For this exercise we use make CC=xlc_r
to illustrate the performance impact.
Task Generate a XL-compiled binary by compiling using the following cells. After you've generated a baseline, start editing the Makefile
: Add qprefetch=dscr=1
to the CFLAGS
and rebuild the application and note the performance. Which one is faster?
In general, applications benefit with the default settings of hardware DSCR register (-qprefetch=dscr=0
). However, certain applications also benefit with prefetching turned off.
It is to be noted that DSCR values are highly sensitive to the application. One value that works well for Application A may not help Application B.
Measure performance of the application compiled with XL at default DSCR value
!make CC=xlc_r -B poisson2d
!make run
Measure performance of the application compiled with XL with DSCR value turned off
!make poisson2d_dscr CC=xlc_r -B
!make run
Does Hardware prefetcher help this application? How much impact do you see when you turn off the hardware prefetcher?
We add OpenMP shared-memory parallelism to the application. Also, we study the effect of binding the multi-thread processes to certain cores on the resulting application performance. We do this study for both GCC and XL compilers inorder to learn about the appropriate options that need to be used.
First, we need to change directory to that of Task3. For Task 3 we modify poisson2d.c to invoke an exact copy of the main jacobi loop which is poisson2d_reference
. We parallelize only the main loop but not poisson2d_reference
. The speedup is the performance gain seen in the main loop as compared to the reference loop.
%cd ../Task3
Task: Please add the correct OpenMP directives to poisson2d.c and compilations flags in the Makefile to enable OpenMP with GCC and XL compilers.
poisson2d.c
to add OpenMP parallelism. The pragmas in question are #pragma omp parallel for
(and once it's #pragma omp parallel for reduction(max:error)
– can you guess where?)Makefile
. For GCC, we need to add -fopenmp
and the application needs to be linked with -lgomp
; add both to CFLAGS_GCC
. For XL, we need to add -qsmp=omp
to the list of compilation flags (see CFLAGS_XL
).Afterwards, compile and run the application with the following commands.
!make poisson2d CC=gcc
The command to submit a job to the batch system is prepared in an environment variable $SC19_SUBMIT_CMD
; use it together with eval
. In the following cell, it is shown how to invoke the application using the batch system.
!eval $SC19_SUBMIT_CMD ./poisson2d 1000 1000 1000
Inorder to run the parallel application, we need to set the number of threads using OMP_NUM_THREADS
What is the best performance you can reach by setting the number of threads via OMP_NUM_THREADS=N
with N
being the number of threads? Feel free to play around with the command in the following cell, using 1 thread as an example.
We added --bind none
to prevent jsrun
, the scheduler of Ascent, from overlaying binding options. Also, we use -c ALL_CPUS
to make all CPUs on the compute nodes available to you.
!eval OMP_NUM_THREADS=N $SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000
Different CPU architectures and models come with different configuration of cores. The configuration plays an important role in the run time of the application. We need to optimize for it!
There are applications which can be used to determine the configuration of the processor. Among those are:
lscpu
: Can be used to determine the number of sockets, number of cores, and numb of threads. It gives a very good overview and is available on most Linux systems.ppc64_cpu --smt
: Specifically for POWER, this tool can give information about the number of simulations threads running per core (SMT, Simulataion Multi-Threading).Run ppc64_cpu --smt
to find out about the threading configuration of Ascent!
!eval $SC19_SUBMIT_CMD ppc64_cpu --smt
There are more sources information available
/proc/cpuinfo
: Holds information about virtual cores, including model and clock speed. Available on most Linux system. Usually used together with cat
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list
: Holds information about thread siblings for given CPU core (cpu0
in this case). Use it to find out which thread is mapped to which core.!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
!$$SC19_SUBMIT_CMD cat /sys/devices/system/cpu/cpu5/topology/thread_siblings_list
There are various environment variables available within OpenMP (some specific to GCC) that hold across compilers to specify binding of threads to cores. See, for instance, the OMP_PLACES environment Variable. We also have a GNU specific variable which can also be used to control affinity - GOMP_CPU_AFFINITY
. Setting GOMP_CPU_AFFINITY
is specific to GCC binaries but it internally serves the same function as setting OMP_PLACES
.
Task: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.
Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.
What's your maximum speedup?
aff="{X},{Y},{Z},{A}"
!eval OMP_DISPLAY_ENV=true OMP_PLACES=$aff OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep "OMP_PLACES\|speedup"
aff="X,Y,Z,A"
!eval OMP_DISPLAY_ENV=true GOMP_CPU_AFFINITY=$aff OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 100 | grep "OMP_PLACES\|speedup"
Great!
If you still have time: The same experiments can be repeated with the IBM XL compiler.
The corresponding compiler flag to enable OpenMP parallelism that needs to be used for XL is -qsmp=omp
Task: In the Makefile add the OpenMP flag and generate XL binaries with OpenMP and run the application with various number of threads and note the performance speedup.
!make CC=xlc_r -B run
Run the parallel application with varying numbre of threads (OMP_NUM_THREADS
) and note the performance improvement.
!eval OMP_NUM_THREADS=N $SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000
Now we repeat the exercise of using the right binding of threads for the XL binary. OMP_PLACES
pertains to the XL binary as well as it is an OpenMP variable. GOMP_CPU_AFFINITY
is specific to GCC binary so that cannot be used to set the binding.
Task: Run the application enabled with OpenMP from Part A with different binding configurations. Make sure to at least run a) binding all threads to a single core and b) binding threads to different cores.
Adapt the following command with your configuration – or follow along accordingly in the non-interactive version of the Notebook.
We are mixing Python with Bash (!
) here, so don't get confused (because of this, if we want to use Bash environment variables, we need to use two $$
)
What's your maximum speedup?
for affinity in ["{X},{Y},{Z},{A}", "{P},{Q},{R},{S}"]:
print("Affinity: {}".format(affinity))
!eval OMP_DISPLAY_ENV=true OMP_PLACES=$affinity OMP_NUM_THREADS=4 $$SC19_SUBMIT_CMD -c ALL_CPUS --bind none ./poisson2d 1000 1000 1000 | grep "OMP_PLACES\|speedup"
Likewise we see a higher speedup when we bind the threads to different cores rather than to a single core. This handson illustrates that apart from compiler level tuning, system level tuning is also equally important to obtain performance improvements