Helmholtz GPU Hackathon 2023
This repository holds the documentation for the Helmholtz GPU Hackathon 2023 at Jülich Supercomputing Centre (Forschungszentrum Jülich).
For additional info, please write to Andreas Herten (a.herten@fz-juelich.de) on Slack or email.
Sign-Up
Please use JuDoor to sign up for our training project, training2310: https://judoor.fz-juelich.de/projects/join/training2310
Make sure to accept the usage agreement for JURECA DC and JUWELS Booster.
Please upload your SSH key to the system via JuDoor. The key needs to be restricted to accept accesses only from a specific source, as specified through the from
clause. Please have a look at the associated documentation (SSH Access and Key Upload).
HPC Systems
We are using primarily JURECA DC for the Hackathon, a system with 768 NVIDIA A100 GPUs. As an optional alternative, also the JUWELS Booster system with its 3600 A100 GPUs and stronger node-to-node interconnect (4×200 Gbit/s, instead of 2×200 Gbit/s for JURECA DC) can be utilized. The focus should be on JURECA DC, though.
For the system documentation, see the following websites:
Access
After successfully uploading your key through JuDoor, you should be able to access JURECA DC via
ssh user1@jureca.fz-juelich.de
The hostname for JUWELS Booster is juwels-booster.fz-juelich.de
.
An alternative way of access the systems is through Jupyter JSC, JSC's Jupyter-based web portal available at https://jupyter-jsc.fz-juelich.de. Sessions should generally be launched on the login nodes. A great alternative to X is available through the portal called Xpra. It's great to run the Nsight tools!
Environment
On the systems, different directories are accessible to you. To set environment variables according to a project, call the following snippet after logging in:
jutil env activate -p training2310 -A training2310
This will, for example, make the directory $PROJECT
available to use, which you can use to store data. Your $HOME
will not be a good place for data storage, as it is severely limited! Use $PROJECT
(or $SCRATCH
, see documentation on Available File Systems).
Different software can be loaded to the environment via environment modules, via the module
command. To see available compilers (the first level of a toolchain), type module avail
.
The most relevant modules are
- Compiler:
GCC
(with additionalCUDA
),NVHPC
- MPI:
ParaStationMPI
,OpenMPI
(make sure to have loadedMPI-settings/CUDA
as well)
Containers
JSC supports containers thorugh Apptainer (previously: Singularity) on the HPC systems. The details are covered in a dedicated article in the systems documetnation. Access is subject to accepting a dedicated license agreement (because of special treatment regarding support) on JuDoor.
Once access is granted (check your groups
), Docker containers can be imported and executed similarly to the following example:
$ apptainer pull tf.sif docker://nvcr.io/nvidia/tensorflow:20.12-tf1-py3
$ srun -n 1 --pty apptainer exec --nv tf.sif python3 myscript.py
Batch System
The JSC systems use a special flavor of Slurm as the workload manager (PSSlurm). Most of the vanilla Slurm commands are available with some Jülich-specific additions. An overview of Slurm is available in the according documentation which also gives example job scripts and interactive commands: https://apps.fz-juelich.de/jsc/hps/jureca/batchsystem.html
Please account your jobs to the training2310
project, either by setting the according environment variable with the above jutil
command (as above), or by manually adding -A training2310
to your batch jobs.
Different partitions are available (see documentation for limits):
-
dc-gpu
: All GPU-equipped nodes -
dc-gpu-devel
: Some nodes available for development
For the days of the Hackathon, reservations will be in place to accelerate scheduling of jobs. The reservations will be announced at this point closer to the event.
X-forwarding sometimes is a bit of a challenge, please consider using Xpra in your Browser through Jupyter JSC!
Etc
Previous Documentation
More (although slightly outdated) documentation is available from the 2021 Hackathon in the according JSC Gitlab Hackathon docu branch.
PDFs
See the directory ./pdf/
for PDF version of the documentation.