Skip to content
Snippets Groups Projects
user avatar
Fahad Khalid authored
1) Removed PyTorch samples. 2) Updated README files. 3) Verified that all the training scripts are upto date with those available in the corresponding framework repos.
713cda03
History

Getting started with Deep Learning on Supercomputers

This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers available at the Jülich Supercomputing Center (JSC) for deep learning based projects. It is assumed that the reader is proficient in one or more of the following frameworks:

Note: This tutorial is by no means intended as an introduction to deep learning, or to any of the above mentioned frameworks. If you are interested in educational resources for beginners, please visit this page.

Announcements

  1. Tensorflow and Keras examples (with and without Horovod) are now fully functional on JUWELS as well.
  2. Python 2 support has been removed from the tutorial for all frameworks except Caffe.
  3. Even though PyTorch is available as as system-wide module on the JSC supercomputers, all PyTorch examples have been removed from this tutorial. This is due to the fact that the tutorial developers are not currently working with PyTorch, and are therefore not in a position to provide support for PyTorch related issues.

Table of contents

  1. A word regarding the code samples
  2. Changes made to support loading of pre-downloaded datasets
  3. Applying for user accounts on supercomputers
  4. Logging on to the supercomputers
  5. Cloning the repository
  6. Running a sample
  7. Python 2 support
  8. Distributed training
  9. Credits

1. A word regarding the code samples

Samples for each framework are available in the correspondingly named directory. Each such directory typically contains at least one code sample, which trains a simple artificial neural network on the canonical MNIST hand-written digit classification task. Moreover, job submission scripts are included for all the supercomputers on which this tutorial has been tested. The job scripts will hopefully make it easier to figure out which modules to load. Finally, a README.md file contains further information about the contents of the directory.

Disclaimer: Neither are the samples intended to serve as examples of optimized code, nor do these represent programming best practices.

2. Changes made to support loading of pre-downloaded datasets

It is worth mentioning that all the code samples were taken from the corresponding framework's official samples/tutorials repository, as practitioners are likely familiar with these (links to the original code samples are included in the directory-local README.md). However, the original examples are designed to automatically download the required dataset in a framework-defined directory. This is not a feasible option as compute nodes on the supercomputers do not have access to the Internet. Therefore, the samples have been slightly modified to load data from the datasets directory included in this repository; specific code changes, at least for now, have been marked by comments prefixed with the [HPCNS] tag. For more information see the README.md available in the datasets directory.

3. Applying for user accounts on supercomputers

In case you do not already have an account on your supercomputer of interest, please take a look at the instructions provided in the following sub-sections.

3.1 JURECA and JUWELS

For more information on getting accounts on JURECA and JUWELS, click here.

3.2 JURON

To get a user account on JURON, please follow the steps below:

  1. Write an email to Dirk Pleiter, in which please introduce yourself and mention why you need the account.
  2. Apply for the account via the JuDoor portal (more information about JuDoor is available here). If your work is related to the Human Brain Project (HBP), please join the PCP0 and CPCP0 projects. Otherwise please join the PADC and CPADC projects.

4. Logging on to the supercomputers

Note: From here on it is assumed that you already have an account on your required supercomputer.

4.1 JURECA and JUWELS

Following are the steps required to login (more information: JURECA, JUWELS).

  1. Use SSH to login. Use one of the following commands, depending on your target system:

    ssh <username>@jureca.fz-juelich.de or ssh <username>@juwels.fz-juelich.de

  2. Upon successful login, activate your project environment:

    jutil env activate -p <name of compute project> -A <name of budget>

    Note: To view a list of all project and budget names available to you, please use the following command: jutil user projects -o columns. Each name under the column titled "project" has a corresponding type under the column titled "project-type". All projects with "project-type" "C" are compute projects, and can be used in the <name of compute project> field for the command above. The <name of budget> field should then contain the corresponding name under the "budgets" column. Please click here for more information.

  3. Change to the project directory:

    cd $PROJECT

You should be in your project directory at this point. As the project directory is shared with other project members, it is recommended to create a new directory with your username, and change to that directory. If you'd like to clone this repository elsewhere, please change to the directory of your choice.

4.2 JURON

Following are the steps required to login.

  1. Use SSH to login:

    ssh <username>@juron.fz-juelich.de

  2. Upon successful login, activate your project environment (more information here).

    jutil env activate -p <name of compute project>

    The <name of compute project> can be either CPCP0 or CPADC, depending on whether you are a member of CPCP0 or CPADC (to view a list of all project names available to you, please use the following command: jutil user projects -o columns). Note that as opposed to the corresponding section on JURECA, the <name of budget> is not included. This is because the CPCP0 and CPADC projects do not support accounting.

  3. Change to the project directory:

    cd $PROJECT

You should be in your project directory at this point. As the CPCP0 and CPADC project directories are shared amongst many users from different institutes and organizations, it is recommended to create a personal directory (named after your username) withing the project directory. You can then use your personal directory for all your work, including cloning this tutorial.

5. Cloning the repository

In order to store the datasets within the repository, we use Git LFS. This makes cloning the repository a little bit different. Please find below the instructions on how to clone on different systems. To learn more about Git LFS, click here.

5.1 JURECA and JUWELS

  1. Load the Git LFS module:

    module load git-lfs

  2. Initialize Git LFS:

    git lfs install

  3. Clone the repository, including the datasets:

    git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git

5.2 JURON

The process is simpler on JURON. You can simply clone the repository along with the datasets using the following command:

git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git

6. Running a sample

Let us consider a scenario where you would like to run the mnist.py sample available in the keras directory. This sample trains a CNN on MNIST using Keras on a single GPU. The following sub-sections list the steps required for different supercomputers.

6.1 JURECA and JUWELS

  1. Change directory to the repository root:

    cd ml_dl_on_supercomputers

  2. Change to the keras sub-directory:

    cd keras

  3. Submit the job to run the sample:

    sbatch submit_job_jureca.sh or sbatch submit_job_juwels.sh

That's it; this is all you need for job submission. If you'd like to receive email notifications regarding the status of the job, add the following statement to the "SLURM job configuration" block in the submit_job_jureca.sh (or submit_job_juwels.sh) script (replace <your email address here> with your email address).

#SBATCH --mail-user=<your email address here>

Output from the job is available in the error and output files, as specified in the job configuration.

6.2 JURON

  1. Change directory to the repository root:

    cd ml_dl_on_supercomputers

  2. Change to the keras sub-directory:

    cd keras

  3. Submit the job to run the sample:

    bsub < submit_job_juron_python3.sh

Please note that unlike JURECA and JUWELS, JURON uses LSF for job submission, which is why a different syntax is required for job configuration and submission. Moreover, email notifications are not supported on JURON. For more information on how to use LSF on JURON, use the following command:

man 7 juron-lsf

Output from the job is available in the error and output files, as specified in the job configuration.

7. Python 2 support

As the official support for Python 2 will be be discontinued in 2020, we decided to encourage our users to make the switch to Python 3 already. This also enables us to provide better support for Python 3 based modules, as we no longer have to spend time maintaining Python 2 modules.

The only exception is Caffe, as on JURECA it is available with Python 2 only. Please note however that other than on JURON, Caffe is only available in the JURECA Stage 2018b, i.e., one of the previous stages. We do not intend to provide support for Caffe from Stage 2019a and onward. This is due to the fact that Caffe is no longer being developed.

8. Distributed training

Horovod provides a simple and efficient solution for training artificial neural networks on multiple GPUs across multiple nodes in a cluster. It can be used with Tensorflow and Keras (some other frameworks are supported as well, but not Caffe). In this repository, the horovod directory contains further sub-directories; one for each compatible framework that has been tested. E.g., there is a keras sub-directory that contains samples that utilize distributed training with Keras and Horovod (more information is available in the directory-local README.md).

Please note that Horovod currently only supports a distribution strategy where the entire model is replicated on all GPUs. It is the data that is distributed across the GPUs. If you are interested in model-parallel training, where the model itself can be split and distributed, a different solution is required. We hope to add a sample for model-parallel training at a later time.

Caffe does not support multi-node training. However, it has built-in support for multi-GPU training on a single node (only via the C/C++ interface). The mnist_cmd sample in the caffe directory contains the job script that can be used to train the model on multiple GPUs. Please see the directory-local README.md for further information.

9. Credits

  • Created by: Fahad Khalid (SLNS/HPCNS, JSC)
  • Installation of modules on JURON: Andreas Herten (HPCNS, JSC)
  • Installation of modules on JURECA: Damian Alvarez (JSC), Rajalekshmi Deepu (SLNS/HPCNS, JSC)
  • Review/suggestions/testing: Kai Krajsek (SLNS/HPCNS, JSC), Tabea Kirchner (SLNS/HPCNS, JSC), Susanne Wenzel (INM-1)