Skip to content
Snippets Groups Projects
Commit 2646bba1 authored by Fahad Khalid's avatar Fahad Khalid
Browse files

1) Combined updates from jureca_2019_a and juwels_2019a. 2) Removed python 2...

1) Combined updates from jureca_2019_a and juwels_2019a. 2) Removed python 2 job submission scripts. 3) Started making other required updates.
parent 9e7fb35d
No related branches found
No related tags found
No related merge requests found
Showing
with 65 additions and 177 deletions
# Getting started with ML/DL on Supercomputers
# Getting started with Deep Learning on Supercomputers
This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers
available at the JSC for ML/DL related projects. It is assumed that the reader is proficient in one or
available at the JSC for deep learning based projects. It is assumed that the reader is proficient in one or
more of the following frameworks:
* [Tensorflow](https://www.tensorflow.org/)
......@@ -10,11 +10,14 @@ more of the following frameworks:
* [Caffe](http://caffe.berkeleyvision.org/)
* [Horovod](https://github.com/horovod/horovod)
**Note:** This tutorial is by no means intended as an introduction to ML/DL, or to any of the
**Note:** This tutorial is by no means intended as an introduction to deep learning, or to any of the
above mentioned frameworks. If you are interested in educational resources for beginners, please
visit [this](https://gitlab.version.fz-juelich.de/MLDL_FZJ/MLDL_FZJ_Wiki/wikis/Education) page.
**Note:** This tutorial does not support JUWELS at the moment. We hope to include the steps for JUWELS soon.
## Announcements
1. Tensorflow and Keras examples (with and without Horovod) are now fully functional on JUWELS as well.
2. Python 2 support has been removed from the tutorial for all frameworks except Caffe.
# Table of contents
<!-- TOC -->
......@@ -88,7 +91,7 @@ Otherwise please join the `PADC` and `CPADC` projects.
**Note:** From here on it is assumed that you already have an account on your required supercomputer.
### 4.1 JURECA
### 4.1 JURECA and JUWELS
Following are the steps required to login (more information
[here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/UserInfo/QuickIntroduction.html?nn=1803700)).
......@@ -147,20 +150,13 @@ systems. To learn more about Git LFS, click [here](http://gitlab.pages.jsc.fz-ju
### 5.1 JURECA
1. Load the required module stage:
```
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
```
2. Load the Git LFS module:
1. Load the Git LFS module:
`module load git-lfs/2.6.1`
3. Initialize Git LFS:
`module load git-lfs`
2. Initialize Git LFS:
`git lfs install`
4. Clone the repository, including the datasets:
3. Clone the repository, including the datasets:
`git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git`
......
......@@ -13,11 +13,6 @@ slightly modified. Our changes are limited to,
statements that demonstrate the use of Horovod follow a comment beginning with
`[Horovod]` (as added by Horovod developers).
**Caution:** Where job submission scripts are available for both Python 2 and Python 3, please
do not submit both Python 2 and Python 3 jobs simultaneously, as one of the jobs might fail. If
you would like to try both, please run these in tandem such that the second job is started only
after the first is finished.
## Keras samples
The following Keras samples are included:
......@@ -29,6 +24,9 @@ few more advanced Horovod features are used.
## PyTorch samples
**Note:** PyTorch samples currently DO NOT work on JURECA and JUWELS. These
do however work on JURON.
The following PyTorch samples are included:
1. `mnist.py`: Demonstrates distributed training using Horovod with PyTorch. A
......@@ -36,7 +34,7 @@ simple convolutional neural network is trained on the MNIST dataset.
2. `synthetic_benchmark.py`: A benchmark that can be used to measure performance
of PyTorch with Horovod without using any external dataset.
**Note:** The job scripts for JURECA are prefixed with `.` for these samples, so that
**Note:** The job scripts for JURECA and JUWELS are prefixed with `.` for these samples, so that
these scripts do not appear in the directory listing. The reason for doing this is
that our testing revealed issues with multi-node training. As soon as the issue has
been resolved, we'll make the scripts available.
......
......@@ -35,7 +35,7 @@ batch_size = 128
num_classes = 10
# Horovod: adjust number of epochs based on number of GPUs.
epochs = int(math.ceil(12.0 / hvd.size()))
epochs = int(math.ceil(16.0 / hvd.size()))
# Input image dimensions
img_rows, img_cols = 28, 28
......
......@@ -36,7 +36,7 @@ num_classes = 10
# Enough epochs to demonstrate learning rate warmup and the reduction of
# learning rate when training plateaues.
epochs = 12
epochs = 16
# Input image dimensions
img_rows, img_cols = 28, 28
......
......@@ -2,23 +2,21 @@
# Slurm job configuration
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=HOROVOD_KERAS_MNIST
#SBATCH --gres=gpu:2 --partition=develgpus
#SBATCH --gres=gpu:4 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load TensorFlow/1.12.0-GPU-Python-3.6.6
module load Keras/2.2.4-GPU-Python-3.6.6
module load Horovod/0.15.2-GPU-Python-3.6.6
module load GCC/8.3.0
module load MVAPICH2/2.3.1-GDR
module load TensorFlow/1.13.1-GPU-Python-3.6.8
module load Keras/2.2.4-GPU-Python-3.6.8
module load Horovod/0.16.2-GPU-Python-3.6.8
# Run the program
srun python -u mnist.py
#!/usr/bin/env bash
# Slurm job configuration
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=HOROVOD_KERAS_MNIST
#SBATCH --gres=gpu:2 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load TensorFlow/1.12.0-GPU-Python-2.7.15
module load Keras/2.2.4-GPU-Python-2.7.15
module load Horovod/0.15.2-GPU-Python-2.7.15
# Run the program
srun python -u mnist.py
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 4
#BSUB -R "span[ptile=2]"
#BSUB -gpu "num=2"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J HOROVOD_KERAS_MNIST
# Load the required modules
module load python/2.7.14
module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
module load horovod/0.15.2
module load keras/2.2.4
# Run the program
mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib python -u mnist.py
#!/usr/bin/env bash
# Slurm job configuration
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=HOROVOD_PYTORCH_MNIST
#SBATCH --gres=gpu:2 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load PyTorch/1.0.0-GPU-Python-2.7.15
module load torchvision/0.2.1-GPU-Python-2.7.15
module load Horovod/0.15.2-GPU-Python-2.7.15
# Run the program
srun python -u mnist.py
......@@ -2,22 +2,21 @@
# Slurm job configuration
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=HOROVOD_TFLOW_MNIST
#SBATCH --gres=gpu:2 --partition=develgpus
#SBATCH --gres=gpu:4 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load TensorFlow/1.12.0-GPU-Python-3.6.6
module load Horovod/0.15.2-GPU-Python-3.6.6
module load GCC/8.3.0
module load MVAPICH2/2.3.1-GDR
module load TensorFlow/1.13.1-GPU-Python-3.6.8
module load Keras/2.2.4-GPU-Python-3.6.8
module load Horovod/0.16.2-GPU-Python-3.6.8
# Run the program
srun python -u mnist.py
#!/usr/bin/env bash
# Slurm job configuration
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=HOROVOD_TFLOW_MNIST
#SBATCH --gres=gpu:2 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load GCC/7.3.0
module load MVAPICH2/2.3-GDR
module load TensorFlow/1.12.0-GPU-Python-2.7.15
module load Horovod/0.15.2-GPU-Python-2.7.15
# Run the program
srun python -u mnist.py
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 4
#BSUB -R "span[ptile=2]"
#BSUB -gpu "num=2"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J HOROVOD_TFLOW_MNIST
# Load the required modules
module load python/2.7.14
module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
module load horovod/0.15.2
# Run the program
mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib python -u mnist.py
......@@ -12,11 +12,9 @@
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load GCC/7.3.0
module load TensorFlow/1.12.0-GPU-Python-3.6.6
module load Keras/2.2.4-GPU-Python-3.6.6
module load GCCcore/.8.3.0
module load TensorFlow/1.13.1-GPU-Python-3.6.8
module load Keras/2.2.4-GPU-Python-3.6.8
# Run the program
srun python -u mnist.py
#!/usr/bin/env bash
# Slurm job configuration
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --output=output_%j.out
#SBATCH --error=error_%j.er
#SBATCH --time=00:10:00
#SBATCH --job-name=KERAS_MNIST_CNN
#SBATCH --gres=gpu:1 --partition=develgpus
#SBATCH --mail-type=ALL
# Load the required modules
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load GCC/7.3.0
module load TensorFlow/1.12.0-GPU-Python-2.7.15
module load Keras/2.2.4-GPU-Python-2.7.15
# Run the program
srun python -u mnist.py
File moved
#!/usr/bin/env bash
#BSUB -q normal
#BSUB -W 10
#BSUB -n 1
#BSUB -R "span[ptile=1]"
#BSUB -gpu "num=1"
#BSUB -e "error.%J.er"
#BSUB -o "output_%J.out"
#BSUB -J KERAS_MNIST_CNN
# Load the required modules
module load python/2.7.14
module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
module load keras/2.2.4
# Run the program
python -u mnist.py
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment