Merge branch 'main' into 'main'

Updated in response to agreed ER: comments See merge request !2

Merge branch 'main' into 'main'
acfe0f20 · Ilya Zhukov · 8a70c3a8 · fe7fd19d · acfe0f20 · acfe0f20
Commit acfe0f20 authored May 14, 2024 by Ilya Zhukov
--- a/docs/access.md
+++ b/docs/access.md
@@ -47,7 +47,7 @@ The procedure is documented below for some popular choices:
 - [OpenSSH](#generating-a-key-pair-with-openssh) - a popular choice on GNU/Linux, macOS, and other Unix-like operating systems
 - [PuTTY](#generating-a-key-pair-with-putty) - a popular choice on Windows

-[ER: not sure if this is  unhelpful extra information, but we have it] Multi-Factor Authentication (MFA) is available through JuDoor, but not required.
+Multi-Factor Authentication (MFA) is available through JuDoor, but at this time is opt-in.

 ### Generating a key pair with OpenSSH

@@ -109,7 +109,15 @@ Now copy the contents of the field *Public key for pasting into OpenSSH authoriz

 ![Key generation with PuTTY](putty_key_generator.png)

-[ER: Do we need to mention the MAC issue here?]
+There is a known issue currently with the Windows implementation of OpenSSH. If you see the error message 
+
+```
+Corrupted MAC on input.
+ssh_dispatch_run_fatal: Connection to x.x.x.x port 22: message authentication code incorrect
+```
+while trying to log in, please follow the guidance [here][CorruptMAC]
+
+[CorruptMac]:https://apps.fz-juelich.de/jsc/hps/jureca/known-issues.html#mac-algorithm-related-ssh-connection-issues-from-windows

 ### Uploading the public key


--- a/docs/software-modules.md
+++ b/docs/software-modules.md
@@ -194,16 +194,18 @@ $ module spider LAMMPS/24Dec2020
 ```

 The problem is that LAMMPS is only available in toolchains which include ParaStationMPI.
-It is not necessary to reload the entire toolchain, it is enough to reload the MPI runtime:
+We could simply reload the MPI module rather than having to reload the entire toolchain, but this can sometimes come with unintended consequences, where what is loaded in a `module load` command is not necessarily unloaded while swapping modules.  
+For this reason, we also do not recommend using the `module unload` command, although it is available.
+We would recommend to unload (almost) all modules and start with a fresh environment, using `module purge`:

 ```
+$ module purge
+$ module load GCC
 $ module load ParaStationMPI
 $ module load LAMMPS
 ```
-[ER: is this true? I do not think we would advise this, honestly. Unloading and reloading like this especially with the MPIs sometimes introduces weird problems. I'd probably recommend starting again with module purge]
-Specific modules can be unloaded again using the `module unload` command.
-[ER: We emphatically should not recommend the module unload command. I do not think we can guarantee it's actuall well behaved.]
-To unload (almost) all modules and start with a fresh environment, use `module purge`.
+
+

 The `module` command is part of the Lmod software package.
 It comes with its own help document which you can access by running `module help` and a [user guide is available online](https://lmod.readthedocs.io).

--- a/docs/transfering-and-archiving-data.md
+++ b/docs/transfering-and-archiving-data.md
@@ -139,4 +139,4 @@ The extract compressed files from the archive you can use the same way as for un
 $  tar -xvf <archive name>.tar.gz
 ```

-[ER: this section should probably have a mention of archiving before putting stuff in $ARCHIVE, because these problems get so much worse when you have to put htem on tape]
\ No newline at end of file
+If you have access to `$ARCHIVE`, data is migrated to tape for long term storage. Tape drives are relatively slow, and retrieval of a file requires retrieval of the specific tape it is stored on. It is heavily recommended that you archive your files by tarring them before placing them in $ARCHIVE. This will allow much more efficient retrieval if you need these files later.
\ No newline at end of file
--- a/docs/using-gpus.md
+++ b/docs/using-gpus.md
@@ -2,9 +2,11 @@
 sidebar_position: 9
 ---

+All systems at JSC have nodes which are accelerated by General Purpose Graphics Processing Units (GPGPUs or just GPUs).
+In this section, we will discuss basic aspects of using them, inspecting them during execution, assigning them to particular MPI tasks and talking a little about network architecture, which can be important for efficient usage.
+
 # Using GPUs

-All systems at JSC have nodes which are accelerated by General Purpose Graphics Processing Units (GPGPUs or just GPUs).
 Since the GPUs are all made by NVIDIA, using them is accomplished through their [CUDA SDK](https://docs.nvidia.com/cuda/).
 CUDA is available as a module:

@@ -12,9 +14,14 @@ CUDA is available as a module:
 $ module load CUDA
 ```

-This example is executed on the JUWELS booster.
-To demonstrate how to compile and run a program that uses GPUs, we will use one of the examples included in CUDA and load additionally the compiler `NVHPC` and the MPI implementation `ParaStationMPI`.
-The samples directory of the CUDA installation has a number of exemple codes you can play and learn with:
+This example is executed on the JUWELS booster module.
+To demonstrate how to compile and run a program that uses GPUs, we will use one of the examples included in CUDA.
+To do this, we must additionally load the compiler `NVHPC` and the MPI version `ParaStationMPI`.  
+Additionally, we load a settings module that ensures our MPI implentation is properly set up to use CUDA.  
+The samples directory of the CUDA installation has a number of example codes you can play and learn with.
+
+Here we compile the example using a combination of C++, MPI and CUDA code.
+We load the necessary modules, navigate into our individual user directories for this project, download sample codes from Nvidia using git, and finally navigate into the download folder and build this software:

 ```
 $ module load NVHPC ParaStationMPI MPI-settings/CUDA
@@ -29,7 +36,6 @@ mkdir -p ../../../bin/x86_64/linux/release
 cp simpleMPI ../../../bin/x86_64/linux/release
 ```

-This sample shows how to compile a combination of C++, MPI and CUDA code.
 There should now be an executable called `simpleMPI` inside the `simpleMPI` directory.
 To run the program, use `srun` like before:

@@ -40,25 +46,34 @@ Running on 4 nodes
 Average of square roots is: 0.667305
 PASSED
 ```
+In this command `-A` indicates the account that the compute time is taken from.  
+`-p` indicates the partition (a specific queue in the cluster, either a "normal" one like `batch`, for particular uses like development or for particular resources, like more RAM, or GPUs).  
+A partition that contains nodes equipped with GPUs must be specified - `-p develgpus` for JUWELS and JUSUF, `-p dc-gpu-devel` for JURECA, or `-p develbooster` for JUWELS Booster.  
+You must specify the number of GPUs you want the command being run within `srun ` to have access to. `--gres gpu:4` makes 4 GPUs available to the command being run.  
+It can be useful to set this differently sometimes, for example if you want to run multiple independent tasks on each separate GPU on a node, using `--gres gpu:1`, or on JUSUF, which only has a single GPU per node on its GPU partition.  
+`-N` indicates the number of nodes, and `-n` the number of tasks, as before.  
+`./simpleMPI` runs the program we compiled, from the directory we are currently in.

 :::info

-**Note:** In this output *nodes* are meaning *MPI tasks*. The developers seem to have assumed implicitly that only one GPU with one MPI task is located on one node when executing this software.
+**Note:** In this output *nodes* means *MPI tasks*. The developers of this code have assumed implicitly that only one GPU with one MPI task is located on one node when creating this software.

 :::

-You have to specify a partition that contains nodes equipped with GPUs, `-p develgpus` for JUWELS and JUSUF, `-p dc-gpu-devel` for JURECA, or `-p develbooster` for JUWELS Booster, and you have to specify how many GPUs you want those nodes to have, `--gres gpu:4` (or `--gres gpu:1` on JUSUF).
-
 ## GPU Inspection During Execution

-LLview is an elaborate tool to monitor your jobs and extracts most of the data relevant for many monitoring use cases with low effort from the user side.
-Nevertheless, logging into the compute nodes during job execution is easy and comfortable and in some cases needed.
+LLview is a feature-rich tool we recommend you familiarise yourself with, to monitor your jobs and extract most of the data relevant for many monitoring use cases with low effort from the user side.
+Nevertheless, logging into the compute nodes during job execution is easy and comfortable and, in some cases, necessary.

 In the following bash session on the JUWELS booster a job is initiated through `srun` in the background of this login node session (`&` at the end of the command).  
 Just hit enter after this line to retrieve the normal command line.
-The job is waiting 600 seconds or 10 minutes.
-After logging into the compute node, through `sgoto`, we show the usage of the GPUs with `nvidia-smi`, which can be exchanged with anything you would like to do on the compute node during job execution.
-Afterwards we log out from the compute node, put the executed `srun` command from the background to the foreground with `fg` and cancel this execution by hitting `CTRL-C` a couple of times until the normal command line is available.
+The job is simply waiting 600 seconds or 10 minutes - we won't do any useful work, this is just to demonstrate accessing a node of a running job.  
+We then use  `sgoto` to access a specific node during execution.  
+We then show the usage of the GPUs on that node with `nvidia-smi`.  
+This command is just to act as an example, and can be exchanged with anything you would like to do on the compute node during job execution.  
+Afterwards we log out from the compute node with `exit`, put the executed `srun` command from the background to the foreground with `fg` and cancel this execution by hitting `CTRL-C` a couple of times until the normal command line is available.
+
+If you want to try this example yourself, remember top change the sgoto command to the appropriate JobID, followed by a 0 (indicating the first, and in this case only, node in the job).

 ```
 $ srun -N 1 -n 1 -t 00:10:00 -A training2410 -p develbooster --gres=gpu:4 sleep 600 &
@@ -108,17 +123,17 @@ srun: forcing job termination
 srun: Job step aborted: Waiting up to 6 seconds for job step to finish.
 ```

-`sgoto` takes the job id as first argument and the node number within the job as second argument where the counting starts with 0.
+`sgoto` takes the job id as the first argument and the node number within the job as the second argument where the counting starts with 0.  
 `nvidia-smi` prints some useful information about available GPUs on a node, like temperature, memory usage, currently running processes and power consumption.

 ## GPU Affinity

-On systems with more than one GPU per node, a choice presents itself of which GPU should be visible to which application task.
+On systems with more than one GPU per node, a choice presents itself - which GPU should be visible to which application task(s)?  
 This is controlled through the environment variable `CUDA_VISIBLE_DEVICES`, which can be set to a comma separated list of integers identifying devices to be visible to a task.  
 You can manually define this variable before running your tasks with `srun` if the pinning is going to be the same for every task.

 Let us investigate further on this with a practical example.  
-First, we prepare a device query example.
+First, we prepare a device query example, (remembering to reload the modules from the first example if you are completing this in a different session).

 ```
 $ cd $PROJECT_training2410/$USER/cuda-samples/Samples/1_Utilities/deviceQueryDrv/
@@ -131,11 +146,12 @@ cp deviceQueryDrv ../../../bin/x86_64/linux/release

 This will create the executable `deviceQueryDrv`.
 During the execution of `deviceQueryDrv` all visible CUDA devices are queried.  
-The following sbatch script `gpuAffinityTest.sbatch` written for the JUWELS Booster executes the assisting bash script `gpuAffinityTest.bash` which in turn executes `deviceQueryDrv`.
+The following sbatch script `gpuAffinityTest.sbatch`, written for the JUWELS Booster, executes the assisting bash script `gpuAffinityHelper.bash` which, in turn executes `deviceQueryDrv`.  
+We perform this in this manner, as we wish to get information from multiple commands inside each task, run in parallel.

 ```sh
 #!/bin/bash
-#SBATCH --ntasks=<number of MPI tasks>
+#SBATCH --ntasks=1
 #SBATCH --nodes=1
 #SBATCH --time=00:01:00
 #SBATCH --partition=develbooster
@@ -144,10 +160,10 @@ The following sbatch script `gpuAffinityTest.sbatch` written for the JUWELS Boos

 module load CUDA NVHPC ParaStationMPI MPI-settings/CUDA

-srun bash gpuAffinityTest.bash
+srun bash gpuAffinityHelper.bash
 ```

-The in parallel executed helper script `gpuAffinityTest.bash` will be needed to print the environment variable `CUDA_VISIBLE_DEVICES` for every MPI task initiated.
+The helper script `gpuAffinityHelper.bash` will be needed to print the environment variable `CUDA_VISIBLE_DEVICES` for every MPI task initiated.

 ```sh
 #!/bin/bash
@@ -158,11 +174,13 @@ echo "MPI task" $SLURM_PROCID "with CUDA_VISIBLE_DEVICES =" $CUDA_VISIBLE_DEVICE
 ./deviceQueryDrv
 ```

-The environment variable `SLURM_PROCID` contains the current MPI task ID.
-The definition of the environment variable `CUDA_VISIBLE_DEVICES` is not yet manually done.
+The automatically set environment variable `SLURM_PROCID` contains the current MPI task ID.  
+The definition of the environment variable `CUDA_VISIBLE_DEVICES` will be performed by you.  
 By uncommenting the commented line within `gpuAffinityTest.bash`, `CUDA_VISIBLE_DEVICES` can be defined manually for every task.
+This allows to you to specify which GPUs are visible for which MPI tasks.  
+For the moment, leave it commented out.

-Execute this example for `ntasks=1` and study the output file.  
+Execute this example for `ntasks=1` in `gpuAffinityTest.sbatch` and study the output file.  

 ```
 MPI task 0 with CUDA_VISIBLE_DEVICES = 0,1,2,3
@@ -193,14 +211,17 @@ MPI task 0 with CUDA_VISIBLE_DEVICES = 0,1,2,3
 Result = PASS
 ```

-The value for `CUDA_VISIBLE_DEVICES` at the beginning and the different Bus IDs, representing the 4 GPUs, are of importance.
+Note the value for `CUDA_VISIBLE_DEVICES` at the beginning.
 For this single MPI task all 4 GPUs are visible.  
+Additionally, we can see the Bus IDs for all of the GPUs on this node, which can be useful information but is not important for this tutorial.  
 At the end of the file you can also see the successful interconnectivity tests of the GPUs.

-If the environment variable `CUDA_VISIBLE_DEVICES` is not defined by you, `srun` will provide a default:
+If you do not manually define the environment variable `CUDA_VISIBLE_DEVICES` yourself, `srun` will provide a default:
+
+- for jobs with a single task (`-n 1`) all devices will be visible `CUDA_VISIBLE_DEVICES=0,1,2,3` to that task.
+- for all other jobs, only a single device will be visible per task, with the same device being visible to multiple tasks if there are more tasks than GPUs. 

- for jobs with a single task (`-n 1`) all devices will be visible `CUDA_VISIBLE_DEVICES=0,1,2,3`
- for all other jobs, only a single device will be visible per task, with the same device being visible to multiple tasks if there are more tasks than GPUs
+Tasks will only have access to the GPUs in the environment variable `CUDA_VISIBLE_DEVICES` for that specific task.

 By playing a little with the number of tasks within the scripts stated above you can study the behaviour of the GPU pinning and confirm the default.
 If this default is not suited to your needs you can uncomment the line
@@ -209,21 +230,20 @@ export CUDA_VISIBLE_DEVICES=<comma-separated list of visible gpus>
 ```
 and define `CUDA_VISIBLE_DEVICES` as you wish.

-[ER: I think this is really difficult to follow and unclear as a set of examples. I understand it because I know what it means. I think this should be rephrased pretty heavily (which I'm happy to do)]
-
 ## Network Architecture Study

-The JUWELS Booster delivers a network infrastructure accelerating direct data exchange between the GPUs.
-These GPUs have internal hardware to store data and are directly connected to the high-performance network.
+The JUWELS Booster delivers a network infrastructure allowing direct data exchange between the GPUs, which can accelerate a workload.
+These GPUs have internal hardware to store data and are directly connected to the high-performance network (other nodes, storage, etc.).

 ![Network connection scheme of JUWELS Booster](network_connection_scheme_juwels_booster.png)

-Having this in mind, the traditional data exchange between two GPUs, with an intermediate hop of data on host memory, will lead to reduced performance.
+With this in mind, it becomes clear that the traditional data exchange between two GPUs, with an intermediate hop of data on host memory, will lead to less-than-ideal performance.
 [CUDA-awareness](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/) of an MPI implementation is a vital part to increase data exchange performance between GPUs.
-At the supercomputing infrastructure from JSC there are preinstalled implementations for CUDA-aware MPI, like `ParaStationMPI` and `OpenMPI`.
-CUDA-awareness enables to pass a pointer to data on the GPU directly to an MPI-directive.
+On JSC supercomputing resources there are preinstalled implementations for CUDA-aware MPI, like `ParaStationMPI` and `OpenMPI`.
+To enable CUDA-awareness, you need to load the module MPI-settings/CUDA.
+CUDA-awareness enables passing a pointer to data on the GPU directly to an MPI-directive.

-The following example `mpiBroadcasting.cpp` performs three different measurements for data exchange by use of `MPI_Bcast`.
+The following example `mpiBroadcasting.cpp` performs three different measurements for speed of data exchange by use of the MPI directive `MPI_Bcast`.
 `MPI_Bcast` broadcasts data from one MPI process to other MPI processes.
 In the source code below, at first, data between host memories is exchanged.
 Secondly, data between GPUs is exchanged by hopping intermediately onto the host memory.
@@ -311,20 +331,20 @@ int main(int argc, char *argv[])
 }
 ```

-The initial broadcasts are needed to let the network establish connections between the MPI tasks.
-Some implementations of MPI are setting up network connections between MPI tasks only at first data exchange.
+The initial broadcasts are needed to let the network establish connections between the MPI tasks, so it does not make sense to measure these.
+Some implementations of MPI are setting up network connections between MPI tasks only at first data exchange, to avoid setting up connections that are never required.
 This is an offset which is not planned to be measured here.
-`MPI_Barrier` directs all MPI tasks to wait until all data was broadcasted. As a result there are three times measured and printed.
+`MPI_Barrier` directs all MPI tasks to wait until all data was broadcasted.  
+As a result there are three times measured and printed - host to host, GPU to GPU through the host, and direct GPU to GPU.
 This example is executed on 2 nodes with 4 tasks on every node, where each task occupies one GPU.

 :::warning

-Note that we switch the compiler at this stage, when you compare to the previous instructions of this chapter.
+Note that we switch compiler at this stage, compared to previous instructions of this chapter.

+Use the same modules for compilation which you are planning to use for execution.
 :::

-It is worth it mentioning that you should use the same modules for compilation which you are planning to use for the execution.
-
 ```
 $ module load NVHPC CUDA OpenMPI
 $ mpicxx -O0 -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart -lcuda mpiBroadcasting.cpp
@@ -336,9 +356,11 @@ Broadcasting to all GPUs took 2.625439 seconds.

 The parameter `-O0` deactives any optimizations performed by the compiler, which is needed since a powerful compiler could know at compile time that the same data is initialized for all tasks and then sent around.
 This could lead to a deletion of the MPI directives at compile time leading to extremely small but erroneous time measurements.
+Other parts of this command are related to supplying the libraries on which `mpiBroadcasting.cpp` depends.
+
 The data exchange directly from one GPU to another GPU is the fastest.
 Furthermore, the CPUs on the JUWELS Booster nodes have a relatively small compute performance, to avoid too much overhead and unnecessary power consumption.
-These nodes are designed such that as much workload and data exchange as possible should be performed by the GPUs.  
+These nodes are designed intentionally such that as much workload and data exchange as possible should be performed by the GPUs.  
 You can study the source code and play around with this setup.
 This will give you valuable insights on how to develop your own software for execution on the JUWELS Booster.

@@ -355,4 +377,4 @@ If you want more details, you can find the documentation for our various systems
 The [CUDA SDK](https://docs.nvidia.com/cuda/) documentation gives you detailed information about how to develop CUDA code.
 There are also excellent articles in the web for learning CUDA like [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) or [An Introduction to CUDA-Aware MPI](https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/).

-The JSC regularly offers [CUDA courses for HPC](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses) being an ideal starting point to get into the topic.
\ No newline at end of file
+The JSC regularly offers [CUDA courses for HPC](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses), which are an ideal starting point to get into the topic.
\ No newline at end of file