Skip to content
Snippets Groups Projects
Commit 6aa34171 authored by Xin Li's avatar Xin Li
Browse files

updated number of threads (up to 16)

parent 20e08ca6
No related branches found
No related tags found
No related merge requests found
......@@ -48,7 +48,7 @@ For the 2022 PDC Summer School the reservation name is ``summer-<date>``, where
An environment variable specifying the number of threads should also be set:
```
export OMP_NUM_THREADS=128
export OMP_NUM_THREADS=16
```
Then the ``srun`` command is used to launch an OpenMP application:
......@@ -57,7 +57,7 @@ Then the ``srun`` command is used to launch an OpenMP application:
srun -n 1 ./example.x
```
In this example we will start one task with 128 threads.
In this example we will start one task with 16 threads.
It is important to use the `srun` command since otherwise the job will run on the login node.
......
......@@ -72,7 +72,7 @@ In this exercise, we explore parallel performance refers to the computational sp
### Tasks and questions to be addressed
1) Measure run time $\Delta$*T*<sub>*n*</sub> for *n* = 1, 2, ..., 24 threads and calculate the speed-up.
1) Measure run time $\Delta$*T*<sub>*n*</sub> for *n* = 1, 2, ..., 16 threads and calculate the speed-up.
2) Is it linear? If not, why?
3) Finally, is the obtained speed-up acceptable?
4) Try to increase the space discretization (M,N) and see if it affects the speed-up.
......
......@@ -86,7 +86,7 @@ This implementation performs repeated execution of the benchmarked kernel to mak
- ``int omp_get_thread_num()`` to get thread ID
- ``double omp_get_wtime()`` to get the time in seconds since a fixed point in the past
- ``omp_set_num_threads()`` to set the number of threads to be used
2) Run the parallel code and take the execution time with 1, 2, 4, 12, 24 threads for different array length ``N``. Record the timing.
2) Run the parallel code and take the execution time with 1, 2, 4, 8, 16 threads for different array length ``N``. Record the timing.
3) Produce a plot showing execution time as a function of array length for different number of threads.
4) How large does ``N`` has to be for using 2 threads becoming more beneficial compared to a single thread?
5) How large needs ``N`` to be chosen for all arrays not to fit into the L3 cache?
......@@ -135,7 +135,7 @@ A simple serial C code to calculate $\pi$ is the following:
- ``int omp_get_thread_num()`` to get thread ID
- ``double omp_get_wtime()`` to get the time in seconds since a fixed point in the past
- ``omp_set_num_threads()`` to set the number of threads to be used
2) Run the parallel code and take the execution time with 1, 2, 4, 8, 12, 24 threads. Record the timing.
2) Run the parallel code and take the execution time with 1, 2, 4, 8, 16 threads. Record the timing.
3) How does the execution time change varying the number of threads? Is it what you expected? If not, why do you think it is so?
4) Is there any technique you heard of in class to improve the scalability of the technique? How would you implement it?
......@@ -145,7 +145,7 @@ Hints:
- Divide loop iterations between threads (use the thread ID and the number of threads).
- Create an accumulator for each thread to hold partial sums that you can later combine to generate the global sum.
## Exercise 3 - Calculate $\pi$ using critical and atomic directives
## Exercise 4 - Calculate $\pi$ using critical and atomic directives
_Concepts: parallel region, synchronization, critical, atomic_
......@@ -164,7 +164,7 @@ Hints:
- We can use a shared variable $\pi$ to be updated concurrently by different threads. However, this variable needs to be protected with a critical section or an atomic access.
- Use critical and atomic before the update ``pi += step``
## Exercise 4 - Calculate &pi; with a loop and a reduction
## Exercise 5 - Calculate &pi; with a loop and a reduction
_Concepts: worksharing, parallel loop, schedule, reduction_
......@@ -173,7 +173,7 @@ Here we are going to implement a fourth parallel version of the [pi.c](pi.c) / [
### Tasks and questions to be addressed
1) Create a new parallel versions of the [pi.c](pi.c) / [pi.f90](pi.f90) program using the parallel construct ``#pragma omp for`` and ``reduction`` operation.
2) Run the new parallel code and take the execution time for 1, 2, 4, 8, 12, 24 threads. Record the timing in a table. Change the schedule to dynamic and guided and measure the execution time for 1, 2, 4, 8, 12, 24 threads.
2) Run the new parallel code and take the execution time for 1, 2, 4, 8, 16 threads. Record the timing in a table. Change the schedule to dynamic and guided and measure the execution time for 1, 2, 4, 8, 16 threads.
3) What is the scheduling that provides the best performance? What is the reason for that?
4) What is the fastest parallel implementation of pi.c / pi.f90 program? What is the reason for it being the fastest? What would be an even faster implementation of pi.c / pi.f90 program?
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment