Getting started with Deep Learning on Supercomputers
This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers available at the Jülich Supercomputing Center (JSC) for deep learning based projects. It is assumed that the reader is proficient in one or more of the following frameworks:
- Tensorflow
- Keras
- Horovod
- Caffe (limited support)
Note: This tutorial is by no means intended as an introduction to deep learning, or to any of the above mentioned frameworks. If you are interested in educational resources for beginners, please visit this page.
Announcements
- Tensorflow and Keras examples (with and without Horovod) are now fully functional on JUWELS as well.
- Python 2 support has been removed from the tutorial for all frameworks except Caffe.
- Even though PyTorch is available as as system-wide module on the JSC supercomputers, all PyTorch examples have been removed from this tutorial. This is due to the fact that the tutorial developers are not currently working with PyTorch, and are therefore not in a position to provide support for PyTorch related issues.
Table of contents
- A word regarding the code samples
- Changes made to support loading of pre-downloaded datasets
- Applying for user accounts on supercomputers
- Logging on to the supercomputers
- Cloning the repository
- Running a sample
- Python 2 support
- Distributed training
- Credits
1. A word regarding the code samples
Samples for each framework are available in the correspondingly named directory. Each such
directory typically contains at least one code sample, which trains a simple artificial neural
network on the canonical MNIST hand-written digit classification task. Moreover, job submission
scripts are included for all the supercomputers on which this tutorial has been tested. The job
scripts will hopefully make it easier to figure out which modules to load. Finally,
a README.md
file contains further information about the contents of the directory.
Disclaimer: Neither are the samples intended to serve as examples of optimized code, nor do these represent programming best practices.
2. Changes made to support loading of pre-downloaded datasets
It is worth mentioning that all the code samples were taken from the corresponding framework's
official samples/tutorials repository, as practitioners are likely familiar with these (links
to the original code samples are included in the directory-local README.md
). However, the
original examples are designed to automatically download the required dataset in a
framework-defined directory. This is not a feasible option as compute nodes on the supercomputers
do not have access to the Internet. Therefore, the samples have been slightly modified to load data from
the datasets
directory included in this repository; specific code changes, at least for now,
have been marked by comments prefixed with the [HPCNS]
tag. For more information see the README.md
available in the datasets
directory.
3. Applying for user accounts on supercomputers
In case you do not already have an account on your supercomputer of interest, please take a look at the instructions provided in the following sub-sections.
3.1 JURECA and JUWELS
For more information on getting accounts on JURECA and JUWELS, click here.
3.2 JURON
To get a user account on JURON, please follow the steps below:
- Write an email to Dirk Pleiter, in which please introduce yourself and mention why you need the account.
- Apply for the account via the JuDoor portal
(more information about JuDoor is available here).
If your work is related to the Human Brain Project (HBP), please join the
PCP0
andCPCP0
projects. Otherwise please join thePADC
andCPADC
projects.
4. Logging on to the supercomputers
Note: From here on it is assumed that you already have an account on your required supercomputer.
4.1 JURECA and JUWELS
Following are the steps required to login (more information: JURECA, JUWELS).
-
Use SSH to login. Use one of the following commands, depending on your target system:
ssh <username>@jureca.fz-juelich.de
orssh <username>@juwels.fz-juelich.de
-
Upon successful login, activate your project environment:
jutil env activate -p <name of compute project> -A <name of budget>
Note: To view a list of all project and budget names available to you, please use the following command:
jutil user projects -o columns
. Each name under the column titled "project" has a corresponding type under the column titled "project-type". All projects with "project-type" "C" are compute projects, and can be used in the<name of compute project>
field for the command above. The<name of budget>
field should then contain the corresponding name under the "budgets" column. Please click here for more information. -
Change to the project directory:
cd $PROJECT
You should be in your project directory at this point. As the project directory is shared with other project members, it is recommended to create a new directory with your username, and change to that directory. If you'd like to clone this repository elsewhere, please change to the directory of your choice.
4.2 JURON
Following are the steps required to login.
-
Use SSH to login:
ssh <username>@juron.fz-juelich.de
-
Upon successful login, activate your project environment (more information here).
jutil env activate -p <name of compute project>
The
<name of compute project>
can be eitherCPCP0
orCPADC
, depending on whether you are a member ofCPCP0
orCPADC
(to view a list of all project names available to you, please use the following command:jutil user projects -o columns
). Note that as opposed to the corresponding section on JURECA, the<name of budget>
is not included. This is because theCPCP0
andCPADC
projects do not support accounting. -
Change to the project directory:
cd $PROJECT
You should be in your project directory at this point. As the CPCP0
and CPADC
project directories
are shared amongst many users from different institutes and organizations, it is recommended to create
a personal directory (named after your username) withing the project directory. You can then use your
personal directory for all your work, including cloning this tutorial.
5. Cloning the repository
In order to store the datasets within the repository, we use Git LFS. This makes cloning the repository a little bit different. Please find below the instructions on how to clone on different systems. To learn more about Git LFS, click here.
5.1 JURECA and JUWELS
-
Load the Git LFS module:
module load git-lfs
-
Initialize Git LFS:
git lfs install
-
Clone the repository, including the datasets:
git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git
5.2 JURON
The process is simpler on JURON. You can simply clone the repository along with the datasets using the following command:
git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git
6. Running a sample
Let us consider a scenario where you would like to run the mnist.py
sample available in the keras
directory. This sample trains a CNN on MNIST using Keras on a single GPU. The following sub-sections list
the steps required for different supercomputers.
6.1 JURECA and JUWELS
-
Change directory to the repository root:
cd ml_dl_on_supercomputers
-
Change to the keras sub-directory:
cd keras
-
Submit the job to run the sample:
sbatch submit_job_jureca.sh
orsbatch submit_job_juwels.sh
That's it; this is all you need for job submission. If you'd like to receive email notifications
regarding the status of the job, add the following statement to the "SLURM job configuration"
block in the submit_job_jureca.sh
(or submit_job_juwels.sh
) script (replace <your email address here>
with your
email address).
#SBATCH --mail-user=<your email address here>
Output from the job is available in the error
and output
files, as specified in the job
configuration.
6.2 JURON
-
Change directory to the repository root:
cd ml_dl_on_supercomputers
-
Change to the keras sub-directory:
cd keras
-
Submit the job to run the sample:
bsub < submit_job_juron_python3.sh
Please note that unlike JURECA and JUWELS, JURON uses LSF for job submission, which is why a different syntax is required for job configuration and submission. Moreover, email notifications are not supported on JURON. For more information on how to use LSF on JURON, use the following command:
man 7 juron-lsf
Output from the job is available in the error
and output
files, as specified in the job
configuration.
7. Python 2 support
As the official support for Python 2 will be be discontinued in 2020, we decided to encourage our users to make the switch to Python 3 already. This also enables us to provide better support for Python 3 based modules, as we no longer have to spend time maintaining Python 2 modules.
The only exception is Caffe, as on JURECA it is available with Python 2 only. Please note however that other than on JURON, Caffe is only available in the JURECA Stage 2018b, i.e., one of the previous stages. We do not intend to provide support for Caffe from Stage 2019a and onward. This is due to the fact that Caffe is no longer being developed.
8. Distributed training
Horovod provides a simple and efficient solution for
training artificial neural networks on multiple GPUs across multiple nodes in a cluster. It can
be used with Tensorflow and Keras (some other frameworks are supported as well, but
not Caffe). In this repository, the horovod
directory contains further sub-directories; one
for each compatible framework that has been tested. E.g., there is a keras
sub-directory that
contains samples that utilize distributed training with Keras and Horovod (more information is available
in the directory-local README.md
).
Please note that Horovod currently only supports a distribution strategy where the entire model is replicated on all GPUs. It is the data that is distributed across the GPUs. If you are interested in model-parallel training, where the model itself can be split and distributed, a different solution is required. We hope to add a sample for model-parallel training at a later time.
Caffe does not support multi-node training. However, it has built-in support for multi-GPU
training on a single node (only
via the C/C++ interface). The mnist_cmd
sample in the caffe
directory contains the job
script that can be used to train the model on multiple GPUs. Please see the
directory-local README.md
for further information.
9. Credits
- Created by: Fahad Khalid (SLNS/HPCNS, JSC)
- Installation of modules on JURON: Andreas Herten (HPCNS, JSC)
- Installation of modules on JURECA: Damian Alvarez (JSC), Rajalekshmi Deepu (SLNS/HPCNS, JSC)
- Review/suggestions/testing: Kai Krajsek (SLNS/HPCNS, JSC), Tabea Kirchner (SLNS/HPCNS, JSC), Susanne Wenzel (INM-1)