Skip to content
Snippets Groups Projects
Commit b7eb6c70 authored by Ilya Zhukov's avatar Ilya Zhukov
Browse files

Modify training projects

parent 34993f84
Branches
No related tags found
No related merge requests found
Pipeline #188244 passed
...@@ -20,7 +20,7 @@ This is the case if ...@@ -20,7 +20,7 @@ This is the case if
- you have successfully [applied for computing time](https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/apply-for-computing-time) during one of our calls for project proposals and are now the principal investigator (PI) of your own project, or - you have successfully [applied for computing time](https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/apply-for-computing-time) during one of our calls for project proposals and are now the principal investigator (PI) of your own project, or
- you have gained access to a project either by being invited by the PI or project administrator (PA) or by being granted access upon requesting to join a project through JuDoor. - you have gained access to a project either by being invited by the PI or project administrator (PA) or by being granted access upon requesting to join a project through JuDoor.
We have created a computing time project for this course with a project ID of `training2334`. We have created a computing time project for this course with a project ID of `training2410`.
To join the project, log in to [JuDoor](https://judoor.fz-juelich.de) and click *Join a project* under the *Projects* heading. To join the project, log in to [JuDoor](https://judoor.fz-juelich.de) and click *Join a project* under the *Projects* heading.
Enter the project ID and, if you want to, a message to remind the PI/PA (one of the instructors) why you should be allowed to join the project. Enter the project ID and, if you want to, a message to remind the PI/PA (one of the instructors) why you should be allowed to join the project.
Afterwards the PI/PA will be automatically informed about your join request and can add you to the different systems available in the project. Afterwards the PI/PA will be automatically informed about your join request and can add you to the different systems available in the project.
......
...@@ -32,7 +32,7 @@ The comute time used for one job will be accounted by the following formula: ...@@ -32,7 +32,7 @@ The comute time used for one job will be accounted by the following formula:
Jobs that run on nodes equipped with GPUs are charged in the same way. Jobs that run on nodes equipped with GPUs are charged in the same way.
Independent of the usage of the GPUs the available cores on the host CPU node are taken into account. Independent of the usage of the GPUs the available cores on the host CPU node are taken into account.
Detailed information of each job can be found in KontView which is accessible via the button 'show extended statistics' for each project in [JuDoor](https://judoor.fz-juelich.de/projects/training2334/). Detailed information of each job can be found in KontView which is accessible via the button 'show extended statistics' for each project in [JuDoor](https://judoor.fz-juelich.de/projects/training2410/).
Alternatively, you can execute the following command on the login nodes to query your CPU quota usage: `jutil user cpuquota`. Alternatively, you can execute the following command on the login nodes to query your CPU quota usage: `jutil user cpuquota`.
Further information can be found in the "Accounting" chapter of the corresponding [System Documentation][System Documentation]. Further information can be found in the "Accounting" chapter of the corresponding [System Documentation][System Documentation].
......
...@@ -30,10 +30,10 @@ For brevity's sake, one can also make one of the projects the "active project" a ...@@ -30,10 +30,10 @@ For brevity's sake, one can also make one of the projects the "active project" a
This can also be done through the `jutil` command: This can also be done through the `jutil` command:
``` ```
$ jutil env activate -p training2334 -A training2334 $ jutil env activate -p training2410 -A training2410
``` ```
Now `training2334` is the active project. Now `training2410` is the active project.
Any computational jobs will be accounted against its budget and the special file system locations associated with it can be reached through certain environment variables. Any computational jobs will be accounted against its budget and the special file system locations associated with it can be reached through certain environment variables.
More about that in the next section. More about that in the next section.
...@@ -58,22 +58,22 @@ At least two directories are created for each project: ...@@ -58,22 +58,22 @@ At least two directories are created for each project:
Data projects have access to other storage locations, e.g. the tape based `ARCHIVE` for long term storage of results. Data projects have access to other storage locations, e.g. the tape based `ARCHIVE` for long term storage of results.
The path of these directories is available as the value of environment variables of the form `<directory>_<project>`, e.g. `PROJECT_training2334` or `SCRATCH_training2334`. The path of these directories is available as the value of environment variables of the form `<directory>_<project>`, e.g. `PROJECT_training2410` or `SCRATCH_training2410`.
If you have activated a project in the previous section, you will also have environment variables that are just `PROJECT` and `SCRATCH` that point to the respective directories of the active project. If you have activated a project in the previous section, you will also have environment variables that are just `PROJECT` and `SCRATCH` that point to the respective directories of the active project.
Print the contents of `PROJECT_training2334` and `PROJECT`: Print the contents of `PROJECT_training2410` and `PROJECT`:
``` ```
$ printenv PROJECT_training2334 $ printenv PROJECT_training2410
/p/project/training2334 /p/project/training2410
$ printenv PROJECT $ printenv PROJECT
/p/project/training2334 /p/project/training2410
``` ```
Change into that directory and see what is already there: Change into that directory and see what is already there:
``` ```
$ cd $PROJECT_training2334 $ cd $PROJECT_training2410
$ ls $ ls
``` ```
......
...@@ -37,14 +37,14 @@ If no resources are currently allocated, `srun` can infer from its command line ...@@ -37,14 +37,14 @@ If no resources are currently allocated, `srun` can infer from its command line
After the associated commands have been run, the resources are relinquished and running further commands will have to ask for resources again. After the associated commands have been run, the resources are relinquished and running further commands will have to ask for resources again.
This one-shot mode can be useful when you want to interactively run a few quick jobs with varying sets of resources allocated for them. This one-shot mode can be useful when you want to interactively run a few quick jobs with varying sets of resources allocated for them.
Run the `hostname` command to see how `srun` will run commands on different nodes than the log in nodes. Run the `hostname` command to see how `srun` will run commands on different nodes than the log in nodes.
On JURECA and JUSUF, use this command (Important: do not forget to replace `YYYYMMDD`, where `YYYY` and `MM` and `DD` are the current year and month and day in the Gregorian calendar, e.g. `20231121`): On JURECA and JUSUF, use this command (Important: do not forget to replace `YYYYMMDD`, where `YYYY` and `MM` and `DD` are the current year and month and day in the Gregorian calendar, e.g. `20240522`):
[ER: should there be an explanation of what the hostname command is? I know people have forgotten to remove it before, and we're aiming this at people who don't know how to use a terminal, sometimes...] [ER: should there be an explanation of what the hostname command is? I know people have forgotten to remove it before, and we're aiming this at people who don't know how to use a terminal, sometimes...]
``` ```
$ hostname $ hostname
jrlogin09.jureca jrlogin09.jureca
$ srun -A training2334 --reservation hands-on-YYYYMMDD hostname $ srun -A training2410 --reservation hands-on-YYYYMMDD hostname
srun: job 3472578 queued and waiting for resources srun: job 3472578 queued and waiting for resources
srun: job 3472578 has been allocated resources srun: job 3472578 has been allocated resources
jrc0454 jrc0454
...@@ -58,7 +58,7 @@ To submit to JUWELS Cluster, you want to be logged in to the Cluster login nodes ...@@ -58,7 +58,7 @@ To submit to JUWELS Cluster, you want to be logged in to the Cluster login nodes
``` ```
$ hostname $ hostname
jwlogin02.juwels jwlogin02.juwels
$ srun -A training2334 --reservation hands-on-cluster-YYYYMMDD hostname $ srun -A training2410 --reservation hands-on-cluster-YYYYMMDD hostname
srun: job 9792359 queued and waiting for resources srun: job 9792359 queued and waiting for resources
srun: job 9792359 has been allocated resources srun: job 9792359 has been allocated resources
jwc06n213.juwels jwc06n213.juwels
...@@ -69,7 +69,7 @@ To submit to JUWELS Booster, you want to be logged in to the Booster login nodes ...@@ -69,7 +69,7 @@ To submit to JUWELS Booster, you want to be logged in to the Booster login nodes
``` ```
$ hostname $ hostname
jwlogin24.juwels jwlogin24.juwels
$ srun -A training2334 --reservation hands-on-booster-YYYYMMDD --gres gpu:4 hostname $ srun -A training2410 --reservation hands-on-booster-YYYYMMDD --gres gpu:4 hostname
srun: job 4575092 queued and waiting for resources srun: job 4575092 queued and waiting for resources
srun: job 4575092 has been allocated resources srun: job 4575092 has been allocated resources
jwb0053.juwels jwb0053.juwels
...@@ -85,7 +85,7 @@ $ srun <srun options...> <program> <program options...> ...@@ -85,7 +85,7 @@ $ srun <srun options...> <program> <program options...>
Above we have seen four `srun` options: Above we have seen four `srun` options:
- `-A` (short for `--account`) to charge the resources consumed by the computation to the budget allotted to this course (if you have used `jutil env activate -A training2334` earlier on, you do not need this). - `-A` (short for `--account`) to charge the resources consumed by the computation to the budget allotted to this course (if you have used `jutil env activate -A training2410` earlier on, you do not need this).
:::info :::info
...@@ -115,7 +115,7 @@ For the `<program>` we used `hostname` with no arguments of its own. ...@@ -115,7 +115,7 @@ For the `<program>` we used `hostname` with no arguments of its own.
To run more parallel instances of a program, increase the number of Slurm *tasks* using the `-n` option to `srun`: To run more parallel instances of a program, increase the number of Slurm *tasks* using the `-n` option to `srun`:
``` ```
$ srun --label -A training2334 --reservation hands-on-cluster-YYYYMMDD -n 10 hostname $ srun --label -A training2410 --reservation hands-on-cluster-YYYYMMDD -n 10 hostname
srun: job 3472812 queued and waiting for resources srun: job 3472812 queued and waiting for resources
srun: job 3472812 has been allocated resources srun: job 3472812 has been allocated resources
8: jwc00n002.juwels 8: jwc00n002.juwels
...@@ -141,7 +141,7 @@ Note also the `--label` option to `srun` which prefixes every line of output by ...@@ -141,7 +141,7 @@ Note also the `--label` option to `srun` which prefixes every line of output by
Running more tasks than will fit on a single node will allocate two nodes and split the tasks between nodes: Running more tasks than will fit on a single node will allocate two nodes and split the tasks between nodes:
``` ```
$ srun --label -A training2334 --reservation hands-on-cluster-YYYYMMDD -n 100 hostname $ srun --label -A training2410 --reservation hands-on-cluster-YYYYMMDD -n 100 hostname
srun: job 3473040 queued and waiting for resources srun: job 3473040 queued and waiting for resources
srun: job 3473040 has been allocated resources srun: job 3473040 has been allocated resources
0: jwc00n007.juwels 0: jwc00n007.juwels
...@@ -157,7 +157,7 @@ Running over multiple nodes without intending to is also likely to degrade perfo ...@@ -157,7 +157,7 @@ Running over multiple nodes without intending to is also likely to degrade perfo
You can now also use `srun` to run the `hellompi` program introduced in the previous section on deploying custom software: You can now also use `srun` to run the `hellompi` program introduced in the previous section on deploying custom software:
``` ```
$ srun -A training2334 --reservation hands-on-cluster-YYYYMMDD -n 5 ./hellompi $ srun -A training2410 --reservation hands-on-cluster-YYYYMMDD -n 5 ./hellompi
srun: job 3471349 queued and waiting for resources srun: job 3471349 queued and waiting for resources
srun: job 3471349 has been allocated resources srun: job 3471349 has been allocated resources
hello from process 4 of 5 hello from process 4 of 5
...@@ -197,7 +197,7 @@ However, since the number of CPU cores is always rounded up to the next multiple ...@@ -197,7 +197,7 @@ However, since the number of CPU cores is always rounded up to the next multiple
Using the `-N` command line argument, you can request a number of nodes from the resource manager (remember to specify `--gres gpu:4` for JUWELS Booster): Using the `-N` command line argument, you can request a number of nodes from the resource manager (remember to specify `--gres gpu:4` for JUWELS Booster):
``` ```
$ salloc -A training2334 --reservation hands-on-cluster-YYYYMMDD -N 1 $ salloc -A training2410 --reservation hands-on-cluster-YYYYMMDD -N 1
salloc: Pending job allocation 3475519 salloc: Pending job allocation 3475519
salloc: job 3475519 queued and waiting for resources salloc: job 3475519 queued and waiting for resources
salloc: job 3475519 has been allocated resources salloc: job 3475519 has been allocated resources
...@@ -272,7 +272,7 @@ And enter the following script: ...@@ -272,7 +272,7 @@ And enter the following script:
```sh ```sh
#!/bin/bash #!/bin/bash
#SBATCH --account=training2334 #SBATCH --account=training2410
#SBATCH --reservation=hands-on-cluster-YYYYMMDD #SBATCH --reservation=hands-on-cluster-YYYYMMDD
#SBATCH --nodes=2 #SBATCH --nodes=2
#SBATCH --cpus-per-task=1 #SBATCH --cpus-per-task=1
...@@ -337,7 +337,7 @@ By default, Slurm assumes that the processes you create are single threaded and ...@@ -337,7 +337,7 @@ By default, Slurm assumes that the processes you create are single threaded and
Allocate a node for playing around with this mechanism: Allocate a node for playing around with this mechanism:
``` ```
$ salloc -A training2334 --reservation hands-on-cluster-YYYYMMDD -N 1 $ salloc -A training2410 --reservation hands-on-cluster-YYYYMMDD -N 1
salloc: Pending job allocation 3499694 salloc: Pending job allocation 3499694
salloc: job 3499694 queued and waiting for resources salloc: job 3499694 queued and waiting for resources
salloc: job 3499694 has been allocated resources salloc: job 3499694 has been allocated resources
......
...@@ -18,9 +18,9 @@ The samples directory of the CUDA installation has a number of exemple codes you ...@@ -18,9 +18,9 @@ The samples directory of the CUDA installation has a number of exemple codes you
``` ```
$ module load NVHPC ParaStationMPI MPI-settings/CUDA $ module load NVHPC ParaStationMPI MPI-settings/CUDA
$ cd $PROJECT_training2334/$USER $ cd $PROJECT_training2410/$USER
$ git clone https://github.com/NVIDIA/cuda-samples.git $ git clone https://github.com/NVIDIA/cuda-samples.git
$ cd $PROJECT_training2334/$USER/cuda-samples/Samples/0_Introduction/simpleMPI $ cd $PROJECT_training2410/$USER/cuda-samples/Samples/0_Introduction/simpleMPI
$ make $ make
/p/software/jurecadc/stages/2024/software/psmpi/5.9.2-1-NVHPC-23.7-CUDA-12/bin/mpicxx -I../../../Common -o simpleMPI_mpi.o -c simpleMPI.cpp /p/software/jurecadc/stages/2024/software/psmpi/5.9.2-1-NVHPC-23.7-CUDA-12/bin/mpicxx -I../../../Common -o simpleMPI_mpi.o -c simpleMPI.cpp
/p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -Xcompiler -fPIE -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o simpleMPI.o -c simpleMPI.cu /p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -Xcompiler -fPIE -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o simpleMPI.o -c simpleMPI.cu
...@@ -34,7 +34,7 @@ There should now be an executable called `simpleMPI` inside the `simpleMPI` dire ...@@ -34,7 +34,7 @@ There should now be an executable called `simpleMPI` inside the `simpleMPI` dire
To run the program, use `srun` like before: To run the program, use `srun` like before:
``` ```
$ srun -A training2334 -p <gpu partition> --gres gpu:4 -N 1 -n 4 ./simpleMPI $ srun -A training2410 -p <gpu partition> --gres gpu:4 -N 1 -n 4 ./simpleMPI
[...] [...]
Running on 4 nodes Running on 4 nodes
Average of square roots is: 0.667305 Average of square roots is: 0.667305
...@@ -61,7 +61,7 @@ After logging into the compute node, through `sgoto`, we show the usage of the G ...@@ -61,7 +61,7 @@ After logging into the compute node, through `sgoto`, we show the usage of the G
Afterwards we log out from the compute node, put the executed `srun` command from the background to the foreground with `fg` and cancel this execution by hitting `CTRL-C` a couple of times until the normal command line is available. Afterwards we log out from the compute node, put the executed `srun` command from the background to the foreground with `fg` and cancel this execution by hitting `CTRL-C` a couple of times until the normal command line is available.
``` ```
$ srun -N 1 -n 1 -t 00:10:00 -A training2334 -p develbooster --gres=gpu:4 sleep 600 & $ srun -N 1 -n 1 -t 00:10:00 -A training2410 -p develbooster --gres=gpu:4 sleep 600 &
[1] 25114 [1] 25114
srun: job 5535332 queued and waiting for resources srun: job 5535332 queued and waiting for resources
srun: job 5535332 has been allocated resources srun: job 5535332 has been allocated resources
...@@ -102,7 +102,7 @@ Thu May 12 08:49:34 2022 ...@@ -102,7 +102,7 @@ Thu May 12 08:49:34 2022
$ exit $ exit
logout logout
$ fg $ fg
srun -N 1 -n 1 -t 00:10:00 -A training2334 -p develbooster --gres=gpu:4 sleep 500 srun -N 1 -n 1 -t 00:10:00 -A training2410 -p develbooster --gres=gpu:4 sleep 500
^Csrun: sending Ctrl-C to StepId=5535332.0 ^Csrun: sending Ctrl-C to StepId=5535332.0
srun: forcing job termination srun: forcing job termination
srun: Job step aborted: Waiting up to 6 seconds for job step to finish. srun: Job step aborted: Waiting up to 6 seconds for job step to finish.
...@@ -121,7 +121,7 @@ Let us investigate further on this with a practical example. ...@@ -121,7 +121,7 @@ Let us investigate further on this with a practical example.
First, we prepare a device query example. First, we prepare a device query example.
``` ```
$ cd $PROJECT_training2334/$USER/cuda-samples/Samples/1_Utilities/deviceQueryDrv/ $ cd $PROJECT_training2410/$USER/cuda-samples/Samples/1_Utilities/deviceQueryDrv/
make make
/p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -gencode arch=compute_50,code=compute_50 -o deviceQueryDrv.o -c deviceQueryDrv.cpp /p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -gencode arch=compute_50,code=compute_50 -o deviceQueryDrv.o -c deviceQueryDrv.cpp
/p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=compute_50 -o deviceQueryDrv deviceQueryDrv.o -L/p/software/jurecadc/stages/2024/software/CUDA/12/lib64/stubs -lcuda /p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=compute_50 -o deviceQueryDrv deviceQueryDrv.o -L/p/software/jurecadc/stages/2024/software/CUDA/12/lib64/stubs -lcuda
...@@ -140,7 +140,7 @@ The following sbatch script `gpuAffinityTest.sbatch` written for the JUWELS Boos ...@@ -140,7 +140,7 @@ The following sbatch script `gpuAffinityTest.sbatch` written for the JUWELS Boos
#SBATCH --time=00:01:00 #SBATCH --time=00:01:00
#SBATCH --partition=develbooster #SBATCH --partition=develbooster
#SBATCH --gres=gpu:4 #SBATCH --gres=gpu:4
#SBATCH -A training2334 #SBATCH -A training2410
module load CUDA NVHPC ParaStationMPI MPI-settings/CUDA module load CUDA NVHPC ParaStationMPI MPI-settings/CUDA
...@@ -328,7 +328,7 @@ It is worth it mentioning that you should use the same modules for compilation w ...@@ -328,7 +328,7 @@ It is worth it mentioning that you should use the same modules for compilation w
``` ```
$ module load NVHPC CUDA OpenMPI $ module load NVHPC CUDA OpenMPI
$ mpicxx -O0 -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart -lcuda mpiBroadcasting.cpp $ mpicxx -O0 -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart -lcuda mpiBroadcasting.cpp
$ srun -N 2 -n 8 -t 01:00:00 -A training2334 -p booster --gres=gpu:4 ./a.out $ srun -N 2 -n 8 -t 01:00:00 -A training2410 -p booster --gres=gpu:4 ./a.out
Broadcasting to all host memories took 4.526835 seconds. Broadcasting to all host memories took 4.526835 seconds.
Broadcasting to all GPUs took 7.481972 seconds with intermediate copy to host memory. Broadcasting to all GPUs took 7.481972 seconds with intermediate copy to host memory.
Broadcasting to all GPUs took 2.625439 seconds. Broadcasting to all GPUs took 2.625439 seconds.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment