diff --git a/README.md b/README.md index 6565aed7d9d4924a5c85a0c24ac63ae335c9449b..44a7b24c3cf803e96b2c824217b79afb4b3deae8 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,36 @@ -**Instructions and hints on how to run for the MPI course** +# PDC Summer School: General Instructions for the MPI Labs -# Where to run +## Where to run -The exercises will be run on PDC's CRAY XC-40 system [Beskow](https://www.pdc.kth.se/hpc-services/computing-systems): +The exercises will be run on PDC's cluster [Tegner](https://www.pdc.kth.se/hpc-services/computing-systems/tegner-1.737437): ``` -beskow.pdc.kth.se +tegner.pdc.kth.se ``` -# How to login +## How to login -To access PDC's cluster you should use your laptop and the Eduroam or KTH Open wireless networks. +To access PDC's systems you need an account at PDC. Check the [instructions for obtaining an account](https://www.pdc.kth.se/support/documents/getting_access/get_access.html#apply-via-pdc-webpage). -[Instructions on how to connect from various operating systems](https://www.pdc.kth.se/support/documents/login/login.html). +Once you have an account, you can follow the [instructions on how to connect from various operating systems](https://www.pdc.kth.se/support/documents/login/login.html). +Related to the Kerberos-based authentication environment, please check the [Kerberos commands documentation](https://www.pdc.kth.se/support/documents/login/login.html#general-information-about-kerberos) -# More about the environment on Beskow +## More about the environment on Tegner -The Cray automatically loads several [modules](https://www.pdc.kth.se/support/documents/run_jobs/job_scheduling.html#accessing-software) at login. +Software, which is not available by default, needs to be loaded as a [module](https://www.pdc.kth.se/support/documents/run_jobs/job_scheduling.html#accessing-software) at login. Use ``module avail`` to get a list of available modules. The following modules are of interest for this lab exercises: -- Heimdal - [Kerberos commands](https://www.pdc.kth.se/support/documents/login/login.html#general-information-about-kerberos) -- OpenAFS - [AFS commands](https://www.pdc.kth.se/support/documents/data_management/afs.html) -- SLURM - [batch jobs](https://www.pdc.kth.se/support/documents/run_jobs/queueing_jobs.html) and [interactive jobs](https://www.pdc.kth.se/support/documents/run_jobs/run_interactively.html) -- Programming environment - [Compilers for software development](https://www.pdc.kth.se/support/documents/software_development/development.html) +- **UPDATE** Different versions of the GNU compiler suite (``gcc/*``) +- **UPDATE** Different versions of the Intel compiler suite (``i-compilers/*``) -# Compiling MPI programs on Beskow +For more information see the [software development documentation page](https://www.pdc.kth.se/support/documents/software_development/development.html). + +Home directories are provided through an OpenAFS services. See the [AFS data management page](https://www.pdc.kth.se/support/documents/data_management/afs.html) for more information. + +To use the Tegner compute nodes you have to submit [SLURM batch jobs](https://www.pdc.kth.se/support/documents/run_jobs/queueing_jobs.html) or run [SLURM interactive jobs](https://www.pdc.kth.se/support/documents/run_jobs/run_interactively.html). + + +## Compiling MPI programs on Tegner By default the cray compiler is loaded into your environment. In order to use another compiler you have to swap compiler modules: @@ -65,7 +71,7 @@ A workaround is to add a compiler flag: cc -D_Float128=__float128 source.c ``` -# Running MPI programs on Beskow +## Running MPI programs First it is necessary to book a node for interactive use: @@ -79,10 +85,10 @@ You might also need to specify a reservation by adding the flag Then the srun command is used to launch an MPI application: ``` -srun -n 32 ./example.x +srun -n 24 ./example.x ``` -In this example we will start 32 MPI tasks (there are 32 cores per node on the Beskow nodes). +In this example we will start 24 MPI tasks (there are 24 cores per node on the Tegner Thin nodes). If you do not use srun and try to start your program on the login node then you will get an error similar to @@ -93,9 +99,8 @@ MPID_Init(123).......: channel initialization failed MPID_Init(461).......: PMI2 init failed: 1 ``` - -# MPI Exercises +## MPI Exercises - MPI Lab 1: [Program Structure and Point-to-Point Communication in MPI](lab1/README.md) - MPI Lab 2: [Collective and Non-Blocking Communication](lab2/README.md) -- MPI Lab 3: [Advanced Topics](lab3/README.md) +- MPI Lab 3: [Advanced Topics](lab3/README.md) \ No newline at end of file diff --git a/lab1/README.md b/lab1/README.md index 73f863f7fa6ace0a39af4e6e49564f97b3f95a4b..2460afd61fe50e03d00399e931f2a8f3930b92a4 100644 --- a/lab1/README.md +++ b/lab1/README.md @@ -1,6 +1,8 @@ -# Overview +# PDC Summer School 2021: MPI Lab 1 -In this lab, you will gain familiarity with MPI program structure, and point-to-point communication by working with venerable programs such as "Hello, World", calculation of π, the game of life, and parallel search. +## Introduction + +In this lab, you will gain familiarity with MPI program structure, and point-to-point communication by working with venerable programs such as "Hello, World", calculation of $\pi$, the game of life, and parallel search. ### Goals @@ -8,123 +10,100 @@ Get familiar with MPI program structure, and point-to-point communication by wri ### Duration -Three hours - +3 hours -# Source Codes +### Source Codes - Hello, World: Serial C and Fortran ([hello_mpi.c](hello_mpi.c) and [hello_mpi.f90](hello_mpi.f90)) - Send data across all processes : No source provided -- Calculation of π: Serial C and Fortran ([pi_serial.c](pi_serial.c) and [pi_serial.f90](pi_serial.f90)) +- Calculation of $\pi$: Serial C and Fortran ([pi_serial.c](pi_serial.c) and [pi_serial.f90](pi_serial.f90)) - Parallel Search: Serial C and Fortran ([parallel_search-serial.c](parallel_search-serial.c) and [parallel_search-serial.f90](parallel_search-serial.f90)), input file ([b.data](b.data)), and output file ([reference.found.data](reference.found.data)) - Game of Life: Serial C and Fortran ([game_of_life-serial.c](game_of_life-serial.c) and [game_of_life-serial.f90](game_of_life-serial.f90)) -# Preparation +### Preparation -In preparation for this lab, read the [instructions on logging in to PDC](https://www.pdc.kth.se/support/documents/login/login.html), -which will help you get going on Beskow. +In preparation for this lab, read the "General Instructions for the MPI Labs". -# Exercise 1: Run "Hello, World" +## Exercise 1: Run "Hello, World" -[Compile](https://www.pdc.kth.se/support/documents/software_development/development.html) -and run the "Hello, World" program found in the lecture. Make sure you understand how each processors prints its rank as well as the total number of processors in the communicator MPI_COMM_WORLD. +Compile and run the "Hello, World" program found in the lecture. Make sure you understand how each processors prints its rank as well as the total number of processors in the communicator ``MPI_COMM_WORLD``. - -# Exercise 2: Send data across all processes +## Exercise 2: Send data across all processes (broadcast) Write a program that takes data from process zero and sends it to all of the other processes. That is, process i should receive the data and send it to process i+1, until the last process is reached. - +*Figure 1. Broadcast* -Assume that the data consists of a single integer. For simplicity set the value for the first process directly in the code. You may want to use MPI_Send and MPI_Recv in your solution. + +Assume that the data consists of a single integer. For simplicity set the value for the first process directly in the code. You may want to use ``MPI_Send`` and ``MPI_Recv`` in your solution. -# Exercise 3: Find π Using P2P Communication (Master/Worker) +## Exercise 3: Find $\pi$ using P2P communication (master/worker) -The given program calculates π using an integral approximation. Take the serial version of the program and modify it to run in parallel. +The given program calculates $\pi$ using an integral approximation. Take the serial version of the program and modify it to run in parallel. -First familiarize yourself with the way the serial program works. How does it calculate π? +First familiarize yourself with the way the serial program works. How does it calculate $\pi$? Hint: look at the program comments. How does the precision of the calculation depend on DARTS and ROUNDS, the number of approximation steps? -Hint: edit DARTS to have various input values from 10 to 10000. What do you think will happen to the precision with which we calculate π when we split up the work among the nodes? +Hint: edit DARTS to have various input values from 10 to 10000. What do you think will happen to the precision with which we calculate $\pi$ when we split up the work among the nodes? Now parallelize the serial program. Use only the six basic MPI calls. Hint: As the number of darts and rounds is hard coded then all workers already know it, but each worker should calculate how many are in its share of the DARTS so it does its share of the work. When done, each worker sends its partial sum back to the master, which receives them and calculates the final sum. - -# Exercise 4: Use P2P in "Parallel Search" +## Exercise 4: Use P2P communication for doing "Parallel Search" In this exercise, you learn about the heart of MPI: point-to-point message-passing routines in both their blocking and non-blocking forms as well as the various modes of communication. -Your task is to parallelize the "Parallel Search" problem. In the parallel search problem, the program should find all occurrences of a certain integer, which will be called the target. It should then write the target value, the indices and the number of occurences to an output file. In addition, the program should read both the target value and all the array elements from an input file. +Your task is to parallelize the "Parallel Search" problem. In the parallel search problem, the program should find all occurrences of a certain integer, which will be called the target. It should then write the target value, the indices and the number of occurrences to an output file. In addition, the program should read both the target value and all the array elements from an input file. Hint: One issue that comes up when parallelizing a serial code is handling I/O. As you can imagine, having multiple processes writing to the same file at the same time can produce useless results. A simple solution is to have each process write to an output file named with its rank. Output to these separate files removes the problem. Here is how to do that in C and Fortran: The C version is quite straightforward. ``` -sprintf(outfilename,"found.data_%d",myrank); +sprintf(outfilename, "found.data_%d", myrank); outfile = fopen(outfilename,"w") ; ``` The Fortran version is similar, but working with strings is not something normally done in Fortran. ``` -write(rankchar,'(i4.4)') myrank +write(rankchar, '(i4.4)') myrank outfilename="found.data_" // rankchar -open(unit=11,file=outfilename) +open(unit=11, file=outfilename) ``` +## Exercise 5: Use P2P communication for the "Game of Life" -# Exercise 5: Use P2P in "Game of Life" - -In this exercise, you continue learning about point-to-point message-passing routines in MPI. -After completing this exercise, you should be able to write the real parallel MPI code to solve the Game of Life. +In this exercise, you continue learning about point-to-point message-passing routines in MPI. After completing this exercise, you should be able to write the real parallel MPI code to solve the Game of Life. [Here is some background on the "Game of Life"](Game_of_life.md), in case you are new to the problem. -[Here is some background on the "Game of Life"](Game_of_life.md), in case you're new to the problem. +To start this exercise, add the initialization and finalization routines to the serial "Game of Life" code. This will effectively duplicate the exact same calculation on each processor. In order to show that the code is performing as expected, add statements to print overall size, and the rank of the local process. Don't forget to add the MPI header file. -To start this exercise, add the initialization and finalization routines to the serial "Game of Life" code. This will effectly duplicate the exact same calculation on each processor. In order to show that the code is performing as expected, add statements to print overall size, and the rank of the local process. Don't forget to add the MPI header file. - - -**Domain Decomposition** +### Domain Decomposition In order to truly run the "Game of Life" program in parallel, we must set up our domain decomposition, i.e., divide the domain into chunks and send one chunk to each processor. In the current exercise, we will limit ourselves to two processors. If you are writing your code in C, divide the domain with a horizontal line, so the upper half will be processed on one processor and the lower half on a different processor. If you are using Fortran, divide the domain with a vertical line, so the left half goes to one processor and the right half to another. Hint: Although this can be done with different kinds of sends and receives, use blocking sends and receives for the current problem. We have chosen the configuration described above because in C arrays, rows are contiguous, and in Fortran columns are contiguous. This approach allows the specification of the initial array location and the number of words in the send and receive routines. -One issue that you need to consider is that of internal domain boundaries. Figure 1 shows the "left-right" domain decomposition described above. Each cell needs information from all adjacent cells to determine its new state. With domain decomposition, some of the required cells no longer are available on the local processor. A common way to tackle this problem is through the use of ghost cells. In the current example, a column of ghost cells is added to the right side of the left domain, and a column is also added to the left side of the right domain (shown in Figure 2). After each time step, the ghost cells are filled by passing the appropriate data from the other processor. You may want to refer to the figure in the -[background on the "Game of Life"](Game_of_life.md) to see how to fill the other ghost cells. +One issue that you need to consider is that of internal domain boundaries. Figure 2 shows the "left-right" domain decomposition described above. Each cell needs information from all adjacent cells to determine its new state. With domain decomposition, some of the required cells no longer are available on the local processor. A common way to tackle this problem is through the use of ghost cells. In the current example, a column of ghost cells is added to the right side of the left domain, and a column is also added to the left side of the right domain (shown in Figure 3). After each time step, the ghost cells are filled by passing the appropriate data from the other processor. You may want to refer to the figure in the [background on the "Game of Life"](Game_of_life.md) to see how to fill the other ghost cells. -Figure 1. Left-right domain decomposition. +*Figure 2. Left-right domain decomposition* <img src="lr_decomp.jpg" alt="Figure 1" width="400px"/> -Figure 2. Ghost cells. +*Figure 2. Ghost cells* <img src="ghost.jpg" alt="Figure 2" width="400px"/> -**Your Challenge** +### Your Challenge Implement the domain decomposition described above, and add message passing to the ghost cells. Don't forget to divide the domain using a horizontal line for C and a vertical line for Fortran. In a subsequent lesson we will examine domain decomposition in the opposite direction. +## Acknowledgment -# Solutions - -The solutions will be made available at the end of the lab. - -# Acknowledgment - -The examples in this lab are provided for educational purposes by -[National Center for Supercomputing Applications](http://www.ncsa.illinois.edu/), -(in particular their [Cyberinfrastructure Tutor](http://www.citutor.org/)), -[Lawrence Livermore National Laboratory](https://computing.llnl.gov/) and -[Argonne National Laboratory](http://www.mcs.anl.gov/). Much of the LLNL MPI materials comes from the -[Cornell Theory Center](http://www.cac.cornell.edu/). -We would like to thank them for allowing us to develop the material for machines at PDC. -You might find other useful educational materials at these sites. - +The examples in this lab are provided for educational purposes by [National Center for Supercomputing Applications](http://www.ncsa.illinois.edu/), (in particular their [Cyberinfrastructure Tutor](http://www.citutor.org/)), [Lawrence Livermore National Laboratory](https://computing.llnl.gov/) and [Argonne National Laboratory](http://www.mcs.anl.gov/). Much of the LLNL MPI materials comes from the [Cornell Theory Center](http://www.cac.cornell.edu/). We would like to thank them for allowing us to develop the material for machines at PDC. You might find other useful educational materials at these sites. \ No newline at end of file diff --git a/lab2/README.md b/lab2/README.md index 061bfc59830743937e0d5c2f394121926a3748b0..df84236619e0b6b10db59fcb4ee324824c65705e 100644 --- a/lab2/README.md +++ b/lab2/README.md @@ -1,79 +1,67 @@ -# Overview +# PDC Summer School 2021: MPI Lab 2 -In this lab, you'll get familiar with MPI's Collection Communication routines, using them on programs you previously wrote with point-to-point calls. You'll also explore non-blocking behavior. +## Introduction + +In this lab you will get more familiar with MPI I/O and MPI performance measurements. ### Goals -Get familar with MPI Collective Communication routines and non-blocking calls +Get experience in MPI I/O as well as MPI performance. ### Duration -Three hours - - -# Source Codes - -- Calculation of π: Serial C and Fortran ([pi_serial.c](pi_serial.c) and [pi_serial.f90](pi_serial.f90)) -- Send data across all processes : No source provided -- Parallel Search: Serial C and Fortran ([parallel_search-serial.c](parallel_search-serial.c) and [parallel_search-serial.f90](parallel_search-serial.f90)), - input file ([b.data](b.data)), and output file ([reference.found.data](reference.found.data)) -- Game of Life: Serial C and Fortran ([game_of_life-serial.c](game_of_life-serial.c) and [game_of_life-serial.f90](game_of_life-serial.f90)) - -# Preparation - -In preparation for this lab, read the [general instructions](../README.md) which will help you get going on Beskow. - -# Exercise 1: Calculate π Using Collectives - -Calculates π using a "dartboard" algorithm. If you're unfamiliar with this algorithm, checkout the Wikipedia page on -[Monte Carlo Integration](http://en.wikipedia.org/wiki/Monte_Carlo_Integration) or -*Fox et al.(1988) Solving Problems on Concurrent Processors, vol. 1, page 207.* - -Hint: All processes should contribute to the calculation, with the master averaging the values for π. Consider using `mpi_reduce` to collect results. - +3 hours -# Exercise 2: Send data across all processes using Non-Blocking +### Source Codes -Take the code for sending data across all processes from the MPI Lab 1, and have each node add one to the number received, print out the result, and send the results on. +- MPI I/O. Serial hello world in C and Fortran ([hello_mpi.c](hello_mpi.c) and [hello_mpi.f90](hello_mpi.f90)) +- MPI Derived types and I/O. Serial STL file reader in C and Fortran ([mpi_derived_types.c](mpi_derived_types.c) and [mpi_derived_types.f90](mpi_derived_types.f90) +- MPI Latency: C and Fortran ([mpi_latency.c](mpi_latency.c) and [mpi_latency.f90](mpi_latency.f90)) +- MPI Bandwidth : C and Fortran ([mpi_bandwidth.c](mpi_bandwidth.c) and [mpi_bandwidth.f90](mpi_bandwidth.f90)) +- MPI Bandwidth Non-Blocking: C and Fortran ([mpi_bandwidth-nonblock.c](mpi_bandwidth-nonblock.c) and [mpi_bandwidth-nonblock.f90](mpi_bandwidth-nonblock.f90)) -### Use Proper Synchronization +### Preparation -For the case where you want to use proper synchronization, you'll want to do a non-blocking receive, add one, print, then a non-blocking send. The result should be `1 - 2 - 3 - 4 - 5 ...` +In preparation for this lab, read the "General Instructions for the MPI Labs". -### Try without Synchronization: Detect Race Conditions +## Exercise 1: MPI I/O -To see what happens without synchronization, leave out the `wait`. +MPI I/O is used so that results can be written to the same file in parallel. Take the serial hello world programs and modify them so that instead of writing the output to screen the output is written to a file using MPI I/O. -# Exercise 3: Find π Using Non-Blocking Communications +The simplest solution is likely to be for you to create a character buffer, and then use the ``MPI_File_write_at`` function. -Use a non-blocking send to try to overlap communication and computation. Take the code from Exercise 1 as your starting point. +## Exercise 2: MPI I/O and derived types -# Exercise 4: Implement the "Parallel Search" and "Game of Life" Using Collectives +Take the serial stl reader and modify it such that the stl file is read (and written) in parallel using collective MPI I/O. Use derived types such that the file can be read/written with a maximum of 3 I/O operations per read and write. -In almost every MPI program there are instances where all the processors in a communicator need to perform some sort of data transfer or calculation. These "collective communication" routines are the subject of this exercise and the "Parallel Search" and "Game of Life" programs are no exception. +The simplest solution is likely to create a derived type for each triangle, and then use the ``MPI_File_XXXX_at_all`` function. A correct solution will have the same MD5 hash for both stl models (input and output), unless the order of the triangles has been changed. -### Your First Challenge +``` +md5sum out.stl data/sphere.stl +822aba6dc20cc0421f92ad50df95464c out.stl +822aba6dc20cc0421f92ad50df95464c data/sphere.stl +``` -Modify your previous "Parallel Search" code to change how the master first sends out the target and subarray data to the slaves. Use the MPI broadcast routines to give each slave the target. Use the MPI scatter routine to give all processors a section of the array ``b`` it will search. +## Exercise 3: Bandwidth and latency between nodes -Hint: When you use the standard MPI scatter routine you will see that the global array ``b`` is now split up into four parts and the master process now has the first fourth of the array to search. So you should add a search loop (similar to the workers') in the master section of code to search for the target and calculate the average and then write the result to the output file. This is actually an improvement in performance since all the processors perform part of the search in parallel. +Use ``mpi_wtime`` to compute latency and bandwidth with the bandwidth and latency codes listed above. -### Your Second Challenge +For this exercise you should compare different setups where (a) both MPI ranks are on the same node, e.g. -Modify your previous "Game of Life" code to use `mpi_reduce` to compute the total number of live cells, rather than individual sends and receives. +``` +salloc -N 1 --ntasks-per-node=2 -A <project> -t 00:05:00 +srun -n 2 ./mpi_latency.x +``` -# Solutions +or on separate nodes, e.g. -The solutions will be made available at the end of the lab. +``` +salloc -N 2 --ntasks-per-node=1 -A <project> -t 00:05:00 +srun -n 2 ./mpi_latency.x +``` -# Acknowledgment +Compare the different results and reason about the observed values. -The examples in this lab are provided for educational purposes by -[National Center for Supercomputing Applications](http://www.ncsa.illinois.edu/), -(in particular their [Cyberinfrastructure Tutor](http://www.citutor.org/)), -[Lawrence Livermore National Laboratory](https://computing.llnl.gov/) and -[Argonne National Laboratory](http://www.mcs.anl.gov/). Much of the LLNL MPI materials comes from the -[Cornell Theory Center](http://www.cac.cornell.edu/). -We would like to thank them for allowing us to develop the material for machines at PDC. -You might find other useful educational materials at these sites. +## Acknowledgment +The examples in this lab are provided for educational purposes by [National Center for Supercomputing Applications](http://www.ncsa.illinois.edu/), (in particular their [Cyberinfrastructure Tutor](http://www.citutor.org/)), [Lawrence Livermore National Laboratory](https://computing.llnl.gov/) and [Argonne National Laboratory](http://www.mcs.anl.gov/). Much of the LLNL MPI materials comes from the [Cornell Theory Center](http://www.cac.cornell.edu/). We would like to thank them for allowing us to develop the material for machines at PDC. You might find other useful educational materials at these sites. \ No newline at end of file