Skip to content
Snippets Groups Projects
Select Git revision
  • f593a6c4f69aa11986b2a3836d90f013c25b8cc4
  • 2022 default
  • 2021
  • master protected
  • 2021
5 results

intro_lab

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Xin Li authored
    5c835a4d
    History

    PDC Summer School: OpenMP Lab Assignment

    Overview

    The goal of these exercises is to familiarize you with OpenMP environment and make our first parallel codes with OpenMP. We will also record the code performance and understand race condition and false sharing. This laboratory contains four exercises, each with step-by-step instructions below.

    To run your code, you need first to generate your executable. It is very important that you include a compiler flag telling the compiler that you are going to use OpenMP. If you forget the flag, the compiler will happily ignore all the OpenMP directives and create an executable that runs in serial. Different compilers often have different flags, but often they follow the convention of the GNU compilers and accept the OpenMP flag -fopenmp.

    To compile your C OpenMP code using gcc, therefore, use

    gcc -O2 -fopenmp -o myprog.x myprog.c -lm

    In Fortran, it is recommended to use the Intel compiler

    module load i-compilers
    ifort -O2 -qopenmp -o myprog.x myprog.f90 -lm

    To run your code, you will need to have an (e.g., interactive) allocation:

    salloc -N 1 -t 4:00:00 -A <name-of-allocation> --reservation=<name-of-reservation>

    To set the number of threads, you need to set the OpenMP environment variable:

    export OMP_NUM_THREADS=<number-of-threads>

    To run an OpenMP code on a computing node:

    srun -n 1 ./name_exec

    Exercise 1 - OpenMP Hello World: get familiar with OpenMp Environment

    Concepts: Parallel regions, parallel, thread ID

    Here we are going to implement the first OpenMP program. Expected knowledge includes basic understanding of OpenMP environment, how to compile an OpenMP program, how to set the number of OpenMP threads and retrieve the thread ID number at runtime.

    Your code using 4 threads should behave similarly to:

    Input:

    srun -n 1 ./hello

    Output:

    Hello World from Thread 3
    Hello World from Thread 0
    Hello World from Thread 2
    Hello World from Thread 1

    Tasks and questions to be addressed

    1. Write a C/Fortran code to make each OpenMP thread print "Hello World from Thread X!" with X = thread ID.
    2. How do you change the number of threads?
    3. How many different ways are there to change the number of threads? Which one are those?
    4. How can you make the output ordered from thread 0 to thread 4?

    Hints:

    • Remember to include OpenMP library
    • Retrieve the ID of the thread with omp_get_thread_num() in C or in Fortran OMP_GET_THREAD_NUM().

    Exercise 2 - Parallel load/stores using pragma omp parallel for

    Concepts: Parallel, default data environment, runtime library calls

    Here are considering the parallelisation of a widely used computational pattern, namely adding an array with a scaled array. Serial versions of the this task are provided: stream-triad.c / stream-triad.f90

    This implementation performs repeated execution of the benchmarked kernel to make improve time measurements.

    Tasks and questions to be addressed

    1. Create a parallel version of the programs using a parallel construct: #pragma omp parallel for. In addition to a parallel construct, you might need some runtime library routines:
      • int omp_get_max_threads() to get the maximum number of threads
      • int omp_get_thread_num() to get thread ID
      • double omp_get_wtime() to get the time in seconds since a fixed point in the past
      • omp_set_num_threads() to set the number of threads to be used
    2. Run the parallel code and take the execution time with 1, 2, 4, 12, 24 threads for different array length N. Record the timing.
    3. Produce a plot showing execution time as a function of array length for different number of threads.
    4. How large does N has to be for using 2 threads becoming more beneficial compared to a single thread?
    5. How large needs N to be chosen for all arrays not to fit into the L3 cache?
    6. Compare results for large N and 8 threads using different settings of OMP_PROC_BIND and reason about the observed performance differences.

    Exercise 3 - Parallel calculation of \pi using pragma omp parallel

    Concepts: Parallel, default data environment, runtime library calls

    Here we are going to implement a first parallel version of the pi.c / pi.f90 code to calculate the value of π using the parallel construct. The figure below shows the numerical technique, we are going to use to calculate π.

    PI_integral

    Mathematically, we know that

    PI_Eq_1

    We can approximate the integral as a sum of rectangles

    PI_Eq_2

    where each rectangle has width \Delta and height F(xi) at the middle of interval i.

    A simple serial C code to calculate \pi is the following:

        unsigned long nsteps = 1<<27; /* around 10^8 steps */
        double dx = 1.0 / nsteps;
    
        double pi = 0.0;
        double start_time = omp_get_wtime();
    
        unsigned long i;
        for (i = 0; i < nsteps; i++)
        {
            double x = (i + 0.5) * dx;
            pi += 1.0 / (1.0 + x * x);
        }
        pi *= 4.0 * dx;

    Tasks and questions to be addressed

    1. Create a parallel version of the pi.c / pi.f90 program using a parallel construct: #pragma omp parallel. Pay close attention to shared versus private variables. In addition to a parallel construct, you might need some runtime library routines:
      • int omp_get_max_threads() to get the maximum number of threads
      • int omp_get_thread_num() to get thread ID
      • double omp_get_wtime() to get the time in seconds since a fixed point in the past
      • omp_set_num_threads() to set the number of threads to be used
    2. Run the parallel code and take the execution time with 1, 2, 4, 8, 12, 24 threads. Record the timing.
    3. How does the execution time change varying the number of threads? Is it what you expected? If not, why do you think it is so?
    4. Is there any technique you heard of in class to improve the scalability of the technique? How would you implement it?

    Hints:

    • Use a parallel construct: #pragma omp parallel.
    • Divide loop iterations between threads (use the thread ID and the number of threads).
    • Create an accumulator for each thread to hold partial sums that you can later combine to generate the global sum.

    Exercise 3 - Calculate \pi using critical and atomic directives

    Concepts: parallel region, synchronization, critical, atomic

    Here we are going to implement a second and a third parallel version of the pi.c / pi.f90 code to calculate the value of \pi using the critical and atomic directives.

    Tasks and questions to be addressed

    1. Create two new parallel versions of the pi.c / pi.f90 program using the parallel construct #pragma omp parallel and a) #pragma omp critical b) #pragma omp atomic.
    2. Run the two new parallel codes and take the execution time with 1, 2, 4, 8, 16, 32 threads. Record the timing in a table.
    3. What would happen if you hadn’t used critical or atomic a shared variable?
    4. How does the execution time change varying the number of threads? Is it what you expected?
    5. Do the two versions of the code differ in performance? If so, what do you think is the reason?

    Hints:

    • We can use a shared variable \pi to be updated concurrently by different threads. However, this variable needs to be protected with a critical section or an atomic access.
    • Use critical and atomic before the update pi += step

    Exercise 4 - Calculate π with a loop and a reduction

    Concepts: worksharing, parallel loop, schedule, reduction

    Here we are going to implement a fourth parallel version of the pi.c / pi.f90 code to calculate the value of \pi; using omp for and reduction operations.

    Tasks and questions to be addressed

    1. Create a new parallel versions of the pi.c / pi.f90 program using the parallel construct #pragma omp for and reduction operation.
    2. Run the new parallel code and take the execution time for 1, 2, 4, 8, 12, 24 threads. Record the timing in a table. Change the schedule to dynamic and guided and measure the execution time for 1, 2, 4, 8, 12, 24 threads.
    3. What is the scheduling that provides the best performance? What is the reason for that?
    4. What is the fastest parallel implementation of pi.c / pi.f90 program? What is the reason for it being the fastest? What would be an even faster implementation of pi.c / pi.f90 program?

    Hints:

    • To change the schedule, you can either change the environment variable with export OMP_SCHEDULE=type where type can be any of static, dynamic, guided or in the source code as omp parallel for schedule(type).