Skip to content
Snippets Groups Projects
Select Git revision
0 results

dl_on_supercomputers

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    Fahad Khalid authored
    Upon deciding to remove sample scripts for Python 2, removed the python versions from the script names. Also, all sample job submission scripts for pytorch have been hidden, and PyTorch examples do not work on JUWELS; with or without Horovod.
    9e7fb35d
    History

    Getting started with ML/DL on Supercomputers

    This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers available at the JSC for ML/DL related projects. It is assumed that the reader is proficient in one or more of the following frameworks:

    Note: This tutorial is by no means intended as an introduction to ML/DL, or to any of the above mentioned frameworks. If you are interested in educational resources for beginners, please visit this page.

    Note: This tutorial does not support JUWELS at the moment. We hope to include the steps for JUWELS soon.

    Table of contents

    1. A word regarding the code samples
    2. Changes made to support loading of pre-downloaded datasets
    3. Applying for user accounts on supercomputers
    4. Logging on to the supercomputers
    5. Cloning the repository
    6. Running a sample
    7. Python 2 support
    8. Distributed training
    9. Credits

    1. A word regarding the code samples

    Samples for each framework are available in the correspondingly named directory. Each such directory typically contains at least one code sample, which trains a simple artificial neural network on the canonical MNIST hand-written digit classification task. Moreover, job submission scripts are included for all the supercomputers on which this tutorial has been tested. The job scripts will hopefully make it easier to figure out which modules to load. Finally, a README.md file contains further information about the contents of the directory.

    Disclaimer: Neither are the samples intended to serve as examples of optimized code, nor do these represent programming best practices.

    2. Changes made to support loading of pre-downloaded datasets

    It is worth mentioning that all the code samples were taken from the corresponding framework's official samples/tutorials repository, as practitioners are likely familiar with these (links to the original code samples are included in the directory-local README.md). However, the original examples are designed to automatically download the required dataset in a framework-defined directory. This is not a feasible option as compute nodes on the supercomputers do not have access to the Internet. Therefore, the samples have been slightly modified to load data from the datasets directory included in this repository; specific code changes, at least for now, have been marked by comments prefixed with the [HPCNS] tag. For more information see the README.md available in the datasets directory.

    3. Applying for user accounts on supercomputers

    In case you do not already have an account on your supercomputer of interest, please take a look at the instructions provided in the following sub-sections.

    3.1 JURECA and JUWELS

    For more information on getting accounts on JURECA and JUWELS, click here.

    3.2 JURON

    To get a user account on JURON, please follow the steps below:

    1. Write an email to Dirk Pleiter, in which please introduce yourself and mention why you need the account.
    2. Apply for the account via the JuDoor portal (more information about JuDoor is available here). If your work is related to the Human Brain Project (HBP), please join the PCP0 and CPCP0 projects. Otherwise please join the PADC and CPADC projects.

    4. Logging on to the supercomputers

    Note: From here on it is assumed that you already have an account on your required supercomputer.

    4.1 JURECA

    Following are the steps required to login (more information here).

    1. Use SSH to login:

      ssh <username>@jureca.fz-juelich.de

    2. Upon successful login, activate your project environment:

      jutil env activate -p <name of compute project> -A <name of budget>

      Note: To view a list of all project and budget names available to you, please use the following command: jutil user projects -o columns. Under the column titled "project", all names that start with the prefix "c" are compute projects, and can be used in the <name of compute project> field for the command above. The <name of budget> field should then contain the corresponding name under the "budgets" column. Please click here for more information.

    3. Change to the project directory:

      cd $PROJECT

    You should be in your project directory at this point. As the project directory is shared with other project members, it is recommended to create a new directory with your username, and change to that directory. If you'd like to clone this repository elsewhere, please change to the directory of your choice.

    4.2 JURON

    Following are the steps required to login.

    1. Use SSH to login:

      ssh <username>@juron.fz-juelich.de

    2. Upon successful login, activate your project environment (more information here).

      jutil env activate -p <name of compute project>

      The <name of compute project> can be either CPCP0 or CPADC, depending on whether you are a member of CPCP0 or CPADC (to view a list of all project names available to you, please use the following command: jutil user projects -o columns). Note that as opposed to the corresponding section on JURECA, the <name of budget> is not included. This is because the CPCP0 and CPADC projects do not support accounting.

    3. Change to the project directory:

      cd $PROJECT

    You should be in your project directory at this point. As the CPCP0 and CPADC project directories are shared amongst many users from different institutes and organizations, it is recommended to create a personal directory (named after your username) withing the project directory. You can then use your personal directory for all your work, including cloning this tutorial.

    5. Cloning the repository

    In order to store the datasets within the repository, we use Git LFS. This makes cloning the repository a little bit different. Please find below the instructions on how to clone on different systems. To learn more about Git LFS, click here.

    5.1 JURECA

    1. Load the required module stage:

      module use /usr/local/software/jureca/OtherStages
      module load Stages/2018b
    2. Load the Git LFS module:

      module load git-lfs/2.6.1

    3. Initialize Git LFS:

      git lfs install

    4. Clone the repository, including the datasets:

      git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git

    5.2 JURON

    The process is simpler on JURON. You can simply clone the repository along with the datasets using the following command:

    git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git

    6. Running a sample

    Let us consider a scenario where you would like to run the mnist.py sample available in the keras directory. This sample trains a CNN on MNIST using Keras on a single GPU. The following sub-sections list the steps required for different supercomputers.

    6.1 JURECA

    1. Change directory to the repository root:

      cd ml_dl_on_supercomputers

    2. Change to the keras sub-directory:

      cd keras

    3. Submit the job to run the sample:

      sbatch submit_job_jureca_python3.sh

    That's it; this is all you need for job submission. If you'd like to receive email notifications regarding the status of the job, add the following statement to the "SLURM job configuration" block in the submit_job_jureca_python3.sh script (replace <your email address here> with your email address).

    #SBATCH --mail-user=<your email address here>

    Output from the job is available in the error and output files, as specified in the job configuration.

    6.2 JURON

    1. Change directory to the repository root:

      cd ml_dl_on_supercomputers

    2. Change to the keras sub-directory:

      cd keras

    3. Submit the job to run the sample:

      bsub < submit_job_juron_python3.sh

    Please note that unlike JURECA, JURON uses LSF for job submission, which is why a different syntax is required for job configuration and submission. Moreover, email notifications are not supported on JURON. For more information on how to use LSF on JURON, use the following command:

    man 7 juron-lsf

    Output from the job is available in the error and output files, as specified in the job configuration.

    7. Python 2 support

    All the code samples are compatible with both Python 2 and Python 3. However, not all frameworks on all machines are available for Python 2 (yet); in certain cases these are only available for Python 3. We have included separate job submission scripts for Python 2 and Python 3. In cases where Python 2 is not supported, only the job submission script for Python 3 is available. We will try our best to make all frameworks available with Python 2 as well, but this will not be a priority as the official support for Python 2 will be discontinued in the year 2020.

    8. Distributed training

    Horovod provides a simple and efficient solution for training artificial neural networks on multiple GPUs across multiple nodes in a cluster. It can be used with Tensorflow, Keras, and PyTorch (some other frameworks are supported as well, but not Caffe). In this repository, the horovod directory contains further sub-directories; one for each compatible framework that has been tested. E.g., there is a keras sub-directory that contains samples that utilize distributed training with Keras and Horovod (more information is available in the directory-local README.md).

    Please note that Horovod currently only supports a distribution strategy where the entire model is replicated on all GPUs. It is the data that is distributed across the GPUs. If you are interested in model-parallel training, where the model itself can be split and distributed, a different solution is required. We hope to add a sample for model-parallel training at a later time.

    Caffe does not support multi-node training. However, it has built-in support for multi-GPU training on a single node (only via the C/C++ interface). The mnist_cmd sample in the caffe directory contains the job script that can be used to train the model on multiple GPUs. Please see the directory-local README.md for further information.

    9. Credits

    • Created by: Fahad Khalid (SLNS/HPCNS, JSC)
    • Installation of modules on JURON: Andreas Herten (HPCNS, JSC)
    • Installation of modules on JURECA: Damian Alvarez (JSC), Rajalekshmi Deepu (SLNS/HPCNS, JSC)
    • Initial review/suggestions/testing: Kai Krajsek (SLNS/HPCNS, JSC), Tabea Kirchner (SLNS/HPCNS, JSC)