diff --git a/docs/access.md b/docs/access.md index 1b4507bd4c721ad438ebd576bdf42413f150cc0d..3dcc71790c5fcfd7fe284da64d3ac0c37323da1f 100644 --- a/docs/access.md +++ b/docs/access.md @@ -20,7 +20,7 @@ This is the case if - you have successfully [applied for computing time](https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/apply-for-computing-time) during one of our calls for project proposals and are now the principal investigator (PI) of your own project, or - you have gained access to a project either by being invited by the PI or project administrator (PA) or by being granted access upon requesting to join a project through JuDoor. -We have created a computing time project for this course with a project ID of `trainingFIXME`. +We have created a computing time project for this course with a project ID of `training2436`. To join the project, log in to [JuDoor](https://judoor.fz-juelich.de) and click *Join a project* under the *Projects* heading. Enter the project ID and, if you want to, a message to remind the PI/PA (one of the instructors) why you should be allowed to join the project. Afterwards the PI/PA will be automatically informed about your join request and can add you to the different systems available in the project. diff --git a/docs/budgeting.md b/docs/budgeting.md index 217505c4b3f2348d3e1fd30ddc43e110b8332fe8..0efad9f9951c0cd2ca4d20c615e4a843bd016293 100644 --- a/docs/budgeting.md +++ b/docs/budgeting.md @@ -31,7 +31,7 @@ The comute time used for one job will be accounted by the following formula: Jobs that run on nodes equipped with GPUs are charged in the same way. Independent of the usage of the GPUs the available cores on the host CPU node are taken into account. -Detailed information of each job can be found in KontView which is accessible via the button 'show extended statistics' for each project in [JuDoor](https://judoor.fz-juelich.de/projects/trainingFIXME/). +Detailed information of each job can be found in KontView which is accessible via the button 'show extended statistics' for each project in [JuDoor](https://judoor.fz-juelich.de/projects/training2436/). Alternatively, you can execute the following command on the login nodes to query your CPU quota usage: `jutil user cpuquota`. Further information can be found in the "Accounting" chapter of the corresponding [System Documentation](./useful-links.md#system-documentation). diff --git a/docs/environment.md b/docs/environment.md index d119de8dda388813b930852362400405a3c235c9..e24a62faee5fe07011caa114b13c0963b24e7fed 100644 --- a/docs/environment.md +++ b/docs/environment.md @@ -30,10 +30,10 @@ For brevity's sake, one can also make one of the projects the "active project" a This can also be done through the `jutil` command: ``` -$ jutil env activate -p trainingFIXME -A trainingFIXME +$ jutil env activate -p training2436 -A training2436 ``` -Now `trainingFIXME` is the active project. +Now `training2436` is the active project. Any computational jobs will be accounted against its budget and the special file system locations associated with it can be reached through certain environment variables. More about that in the next section. @@ -58,22 +58,22 @@ At least two directories are created for each project: Data projects have access to other storage locations, e.g. the tape based `ARCHIVE` for long term storage of results. -The path of these directories is available as the value of environment variables of the form `<directory>_<project>`, e.g. `PROJECT_trainingFIXME` or `SCRATCH_trainingFIXME`. +The path of these directories is available as the value of environment variables of the form `<directory>_<project>`, e.g. `PROJECT_training2436` or `SCRATCH_training2436`. If you have activated a project in the previous section, you will also have environment variables that are just `PROJECT` and `SCRATCH` that point to the respective directories of the active project. -Print the contents of `PROJECT_trainingFIXME` and `PROJECT`: +Print the contents of `PROJECT_training2436` and `PROJECT`: ``` -$ printenv PROJECT_trainingFIXME -/p/project1/trainingFIXME +$ printenv PROJECT_training2436 +/p/project1/training2436 $ printenv PROJECT -/p/project1/trainingFIXME +/p/project1/training2436 ``` Change into that directory and see what is already there: ``` -$ cd $PROJECT_trainingFIXME +$ cd $PROJECT_training2436 $ ls ``` diff --git a/docs/running-jobs.md b/docs/running-jobs.md index 2832e699bdad71646bbc5d52cbf2615edd72d51d..40f4a150e5de4b21ef211787a198571e53136793 100644 --- a/docs/running-jobs.md +++ b/docs/running-jobs.md @@ -47,7 +47,7 @@ Do not forget to replace `YYYYMMDD`, where `YYYY` and `MM` and `DD` are the curr ``` $ hostname jrlogin09.jureca -$ srun -A trainingFIXME --reservation hands-on-YYYYMMDD hostname +$ srun -A training2436 --reservation hands-on-YYYYMMDD hostname srun: job 3472578 queued and waiting for resources srun: job 3472578 has been allocated resources jrc0454 @@ -61,7 +61,7 @@ To submit to JUWELS Cluster, you want to be logged in to the Cluster login nodes ``` $ hostname jwlogin02.juwels -$ srun -A trainingFIXME --reservation hands-on-cluster-YYYYMMDD hostname +$ srun -A training2436 --reservation hands-on-cluster-YYYYMMDD hostname srun: job 9792359 queued and waiting for resources srun: job 9792359 has been allocated resources jwc06n213.juwels @@ -72,7 +72,7 @@ To submit to JUWELS Booster, you want to be logged in to the Booster login nodes ``` $ hostname jwlogin24.juwels -$ srun -A trainingFIXME --reservation hands-on-booster-YYYYMMDD --gres gpu:4 hostname +$ srun -A training2436 --reservation hands-on-booster-YYYYMMDD --gres gpu:4 hostname srun: job 4575092 queued and waiting for resources srun: job 4575092 has been allocated resources jwb0053.juwels @@ -88,7 +88,7 @@ $ srun <srun options...> <program> <program options...> Above we have seen four `srun` options: -- `-A` (short for `--account`) to charge the resources consumed by the computation to the budget allotted to this course (if you have used `jutil env activate -A trainingFIXME` earlier on, you do not need this). +- `-A` (short for `--account`) to charge the resources consumed by the computation to the budget allotted to this course (if you have used `jutil env activate -A training2436` earlier on, you do not need this). :::info @@ -118,7 +118,7 @@ For the `<program>` we used `hostname` with no arguments of its own. To run more parallel instances of a program, increase the number of Slurm *tasks* using the `-n` option to `srun`: ``` -$ srun --label -A trainingFIXME --reservation hands-on-cluster-YYYYMMDD -n 10 hostname +$ srun --label -A training2436 --reservation hands-on-cluster-YYYYMMDD -n 10 hostname srun: job 3472812 queued and waiting for resources srun: job 3472812 has been allocated resources 8: jwc00n002.juwels @@ -144,7 +144,7 @@ Note also the `--label` option to `srun` which prefixes every line of output by Running more tasks than will fit on a single node will allocate two nodes and split the tasks between nodes: ``` -$ srun --label -A trainingFIXME --reservation hands-on-cluster-YYYYMMDD -n 100 hostname +$ srun --label -A training2436 --reservation hands-on-cluster-YYYYMMDD -n 100 hostname srun: job 3473040 queued and waiting for resources srun: job 3473040 has been allocated resources 0: jwc00n007.juwels @@ -160,7 +160,7 @@ Running over multiple nodes without intending to is also likely to degrade perfo You can now also use `srun` to run the `hellompi` program introduced in the previous section on deploying custom software: ``` -$ srun -A trainingFIXME --reservation hands-on-cluster-YYYYMMDD -n 5 ./hellompi +$ srun -A training2436 --reservation hands-on-cluster-YYYYMMDD -n 5 ./hellompi srun: job 3471349 queued and waiting for resources srun: job 3471349 has been allocated resources hello from process 4 of 5 @@ -204,7 +204,7 @@ However, since the number of CPU cores is always rounded up to the next multiple Using the `-N` command line argument, you can request a number of nodes from the resource manager (remember to specify `--gres gpu:4` for JUWELS Booster): ``` -$ salloc -A trainingFIXME --reservation hands-on-cluster-YYYYMMDD -N 1 +$ salloc -A training2436 --reservation hands-on-cluster-YYYYMMDD -N 1 salloc: Pending job allocation 3475519 salloc: job 3475519 queued and waiting for resources salloc: job 3475519 has been allocated resources @@ -281,7 +281,7 @@ And enter the following script: ```sh #!/bin/bash -#SBATCH --account=trainingFIXME +#SBATCH --account=training2436 #SBATCH --reservation=hands-on-cluster-YYYYMMDD #SBATCH --nodes=2 #SBATCH --cpus-per-task=1 @@ -360,7 +360,7 @@ By default, Slurm assumes that the processes you create are single threaded and Allocate a node for playing around with this mechanism: ``` -$ salloc -A trainingFIXME --reservation hands-on-cluster-YYYYMMDD -N 1 +$ salloc -A training2436 --reservation hands-on-cluster-YYYYMMDD -N 1 salloc: Pending job allocation 3499694 salloc: job 3499694 queued and waiting for resources salloc: job 3499694 has been allocated resources diff --git a/docs/using-gpus.md b/docs/using-gpus.md index eca7092922022e5a2e345f24af04f93b70344a5b..f91ebd53b3892e43723308c6521badb2016bcbc3 100644 --- a/docs/using-gpus.md +++ b/docs/using-gpus.md @@ -25,9 +25,9 @@ We load the necessary modules, navigate into our individual user directories for ``` $ module load NVHPC ParaStationMPI MPI-settings/CUDA -$ cd $PROJECT_trainingFIXME/$USER +$ cd $PROJECT_training2436/$USER $ git clone https://github.com/NVIDIA/cuda-samples.git -$ cd $PROJECT_trainingFIXME/$USER/cuda-samples/Samples/0_Introduction/simpleMPI +$ cd $PROJECT_training2436/$USER/cuda-samples/Samples/0_Introduction/simpleMPI $ make /p/software/jurecadc/stages/2024/software/psmpi/5.9.2-1-NVHPC-23.7-CUDA-12/bin/mpicxx -I../../../Common -o simpleMPI_mpi.o -c simpleMPI.cpp /p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -Xcompiler -fPIE -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o simpleMPI.o -c simpleMPI.cu @@ -40,7 +40,7 @@ There should now be an executable called `simpleMPI` inside the `simpleMPI` dire To run the program, use `srun` like before: ``` -$ srun -A trainingFIXME -p <gpu partition> --gres gpu:4 -N 1 -n 4 ./simpleMPI +$ srun -A training2436 -p <gpu partition> --gres gpu:4 -N 1 -n 4 ./simpleMPI [...] Running on 4 nodes Average of square roots is: 0.667305 @@ -76,7 +76,7 @@ Afterwards we log out from the compute node with `exit`, put the executed `srun` If you want to try this example yourself, remember top change the sgoto command to the appropriate JobID, followed by a 0 (indicating the first, and in this case only, node in the job). ``` -$ srun -N 1 -n 1 -t 00:10:00 -A trainingFIXME -p develbooster --gres=gpu:4 sleep 600 & +$ srun -N 1 -n 1 -t 00:10:00 -A training2436 -p develbooster --gres=gpu:4 sleep 600 & [1] 25114 srun: job 5535332 queued and waiting for resources srun: job 5535332 has been allocated resources @@ -117,7 +117,7 @@ Thu May 12 08:49:34 2022 $ exit logout $ fg -srun -N 1 -n 1 -t 00:10:00 -A trainingFIXME -p develbooster --gres=gpu:4 sleep 500 +srun -N 1 -n 1 -t 00:10:00 -A training2436 -p develbooster --gres=gpu:4 sleep 500 ^Csrun: sending Ctrl-C to StepId=5535332.0 srun: forcing job termination srun: Job step aborted: Waiting up to 6 seconds for job step to finish. @@ -142,7 +142,7 @@ Let us investigate further on this with a practical example. First, we prepare a device query example, (remembering to reload the modules from the first example if you are completing this in a different session). ``` -$ cd $PROJECT_trainingFIXME/$USER/cuda-samples/Samples/1_Utilities/deviceQueryDrv/ +$ cd $PROJECT_training2436/$USER/cuda-samples/Samples/1_Utilities/deviceQueryDrv/ make /p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -gencode arch=compute_50,code=compute_50 -o deviceQueryDrv.o -c deviceQueryDrv.cpp /p/software/jurecadc/stages/2024/software/CUDA/12/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=compute_50 -o deviceQueryDrv deviceQueryDrv.o -L/p/software/jurecadc/stages/2024/software/CUDA/12/lib64/stubs -lcuda @@ -162,7 +162,7 @@ We perform this in this manner, as we wish to get information from multiple comm #SBATCH --time=00:01:00 #SBATCH --partition=develbooster #SBATCH --gres=gpu:4 -#SBATCH -A trainingFIXME +#SBATCH -A training2436 module load CUDA NVHPC ParaStationMPI MPI-settings/CUDA @@ -354,7 +354,7 @@ Use the same modules for compilation which you are planning to use for execution ``` $ module load NVHPC CUDA OpenMPI $ mpicxx -O0 -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart -lcuda mpiBroadcasting.cpp -$ srun -N 2 -n 8 -t 01:00:00 -A trainingFIXME -p booster --gres=gpu:4 ./a.out +$ srun -N 2 -n 8 -t 01:00:00 -A training2436 -p booster --gres=gpu:4 ./a.out Broadcasting to all host memories took 4.526835 seconds. Broadcasting to all GPUs took 7.481972 seconds with intermediate copy to host memory. Broadcasting to all GPUs took 2.625439 seconds.