Getting started with Deep Learning on Supercomputers
This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers available at the Jülich Supercomputing Center (JSC) for deep learning based projects. It is assumed that the reader is proficient in the following frameworks:
Note: This tutorial is by no means intended as an introduction to deep learning, or to any of the above mentioned frameworks. If you are interested in educational resources for beginners, please visit this page.
Announcements
- April 26, 2021: The tutorial has been updated to use Tensorflow2. Also, code samples and datasets that are no longer relevant, e.g., those for Caffe, have been removed.
-
November 28, 2019: Slides and code samples for the "Deep Learning on
Supercomputers" talk given as part of the Introduction to the programming
and usage of the supercomputer resources at Jülich
course are now available in the
course_material
directory. - November 22, 2019: Samples for Caffe are no longer supported on JURECA due to system-wide MVAPICH2 module changes.
-
November 18, 2019: The
horovod_data_distributed
directory has been added that contains code samples to illustrate proper data-distributed training with Horovod, i.e., a distribution mechanism where the training data is distributed instead of epochs. Further information is available in the directory-localREADME.md
. - September 02, 2019: Even though PyTorch is available as a system-wide module on the JSC supercomputers, all PyTorch examples have been removed from this tutorial. This is due to the fact that the tutorial developers are not currently working with PyTorch, and are therefore not in a position to provide support for PyTorch related issues.
-
August 23, 2019:
- Tensorflow and Keras examples (with and without Horovod) are now fully functional on JUWELS as well.
- Python 2 support has been removed from the tutorial for all frameworks except Caffe.
Table of contents
- A word regarding the code samples
- Changes made to support loading of pre-downloaded datasets
- Applying for user accounts on supercomputers
- Logging on to the supercomputers
- Cloning the repository
- Running a sample
- Distributed training
- Credits
1. A word regarding the code samples
Samples for each framework are available in the correspondingly named directory.
Each such directory typically contains at least one code sample, which trains a
simple artificial neural network on the canonical MNIST hand-written digit
classification task. Moreover, job submission scripts are included for all the
supercomputers on which this tutorial has been tested. The job scripts will
hopefully make it easier to figure out which modules to load. Finally, a
README.md
file contains further information about the contents of the
directory.
Disclaimer: Neither are the samples intended to serve as examples of optimized code, nor do these represent programming best practices.
2. Changes made to support loading of pre-downloaded datasets
It is worth mentioning that all the code samples were taken from the
corresponding framework's official samples/tutorials repository, as
practitioners are likely familiar with these (links to the original code samples
are included in the directory-local README.md
). However, the original examples
are designed to automatically download the required dataset in a
framework-defined directory. This is not a feasible option while working with
supercomputers as compute nodes do not have access to the Internet. Therefore,
the samples have been slightly modified to load data from the datasets
directory included in this repository; specific code changes, at least for now,
have been marked by comments prefixed with the [HPCNS]
tag. For more
information see the README.md
available in the datasets
directory.
3. Applying for user accounts on supercomputers
In case you do not already have an account on your supercomputer of interest, please refer to the instructions available here, as you will need to apply for computing time before an account is created for you.
4. Logging on to the supercomputers
Note: From here on it is assumed that you already have an account on your required supercomputer.
Note: This tutorial is supported for the following supercomputers: JURECA, JUWELS, JUWELS Booster, and JUSUF.
Following are the steps required to login (more information: JURECA, JUWELS, JUSUF).
For the purpose of this tutorial, we will assume that our system of interest is JURECA. If you intend to use a different system, you can simply replace the system name in the commands below; the procedure is precisely the same for all machines.
-
Use SSH to login:
ssh -i ~/.ssh/<keyfile> <username>@jureca.fz-juelich.de
-
Upon successful login, activate your project environment:
jutil env activate -p <name of compute project> -A <name of budget>
Note: To view a list of all project and budget names available to you, please use the following command:
jutil user projects -o columns
. Each name under the column titled "project" has a corresponding type under the column titled "project-type". All projects with "project-type" "C" are compute projects, and can be used in the<name of compute project>
field for the command above. The<name of budget>
field should then contain the corresponding name under the "budgets" column. Please click here for more information. -
Change to the project directory:
cd $PROJECT
You should be in your project directory at this point. As the project directory is shared with other project members, it is recommended to create a new directory with your username, and change to that directory. If you'd like to clone this repository elsewhere, please change to the directory of your choice.
5. Cloning the repository
In order to store the datasets within the repository, we use Git LFS. This makes cloning the repository slightly different. Please find below the instructions on how to clone the repository. To learn more about Git LFS, click here.
-
Load the Git LFS module:
module load git-lfs
-
Initialize Git LFS:
git lfs install
-
Clone the repository, including the datasets:
git lfs clone https://gitlab.version.fz-juelich.de/hpc4ns/dl_on_supercomputers.git
6. Running a sample
Let us consider a scenario where you would like to run the keras_mnist.py
sample available in the tensorflow
directory. This sample trains a CNN on
MNIST using Tensorflow's Keras API. Following steps can be used to run the
sample:
-
Change directory to the repository root:
cd dl_on_supercomputers
-
Change to the tensorflow sub-directory:
cd tensorflow
-
Submit the job to run the sample:
sbatch jureca_job.sh
That's it; this is all you need for job submission. If you'd like to receive
email notifications regarding the status of the job, add the following statement
to the "SLURM job configuration" block in the jureca_job.sh
script (replace
<your email address here>
with your email address).
#SBATCH --mail-user=<your email address here>
Output from the job is available in the error
and output
files, as specified
in the job configuration.
Note: Please note that the job scripts for all systems are almost exactly
the same, except for the --partition
value. This is because partition names
vary from system to system. Nevertheless, for each system, this tutorial uses
the corresponding development partition, e.g., dc-gpu-devel
on JURECA. This is
because jobs are often (but not always) scheduled faster on this partition than
the production partition. However, resources in the development partitions are
limited (as described in: JURECA,
JUWELS,
and JUSUF).
Therefore, it is highly recommended that users familiarize themselves with the
limitations, and use the production partition for all production use, as well as
when developing/testing with more resources than are available on the
development partition.
7. Distributed training
Horovod provides a simple and efficient solution for training artificial neural networks on multiple GPUs across multiple nodes in a cluster. It can be used with Tensorflow (some other frameworks are supported as well). Since this tutorial primarily concerns distributed training, only code samples that utilize Horovod are included.
Please note that Horovod currently only supports a distribution strategy where the entire model is replicated on every GPU. It is the data that is distributed across the GPUs. If you are interested in model-parallel training, where the model itself can be split and distributed, a different solution is required. We hope to add a sample for model-parallel training at a later time.
8. Credits
- Created by: Fahad Khalid (SLNS/HPCNS, JSC)
- Installation of modules on JURON: Andreas Herten (HPCNS, JSC)
- Installation of modules on JURECA: Damian Alvarez (JSC), Rajalekshmi Deepu (SLNS/HPCNS, JSC)
- Review/suggestions/testing: Kai Krajsek (SLNS/HPCNS, JSC), Tabea Kirchner (SLNS/HPCNS, JSC), Susanne Wenzel (INM-1)