1) Combined updates from jureca_2019_a and juwels_2019a. 2) Removed python 2...

1) Combined updates from jureca_2019_a and juwels_2019a. 2) Removed python 2 job submission scripts. 3) Started making other required updates.

1) Combined updates from jureca_2019_a and juwels_2019a. 2) Removed python 2...
2646bba1 · Fahad Khalid · 9e7fb35d · 2646bba1 · 2646bba1 · 2646bba1
Commit 2646bba1 authored Aug 23, 2019 by Fahad Khalid
--- a/README.md
+++ b/README.md
-# Getting started with ML/DL on Supercomputers
+# Getting started with Deep Learning on Supercomputers

 This repository is intended to serve as a tutorial for anyone interested in utilizing the supercomputers 
-available at the JSC for ML/DL related projects. It is assumed that the reader is proficient in one or 
+available at the JSC for deep learning based projects. It is assumed that the reader is proficient in one or 
 more of the following frameworks:

 *    [Tensorflow](https://www.tensorflow.org/)
@@ -10,11 +10,14 @@ more of the following frameworks:
 *    [Caffe](http://caffe.berkeleyvision.org/)
 *    [Horovod](https://github.com/horovod/horovod)

-**Note:** This tutorial is by no means intended as an introduction to ML/DL, or to any of the
+**Note:** This tutorial is by no means intended as an introduction to deep learning, or to any of the
 above mentioned frameworks. If you are interested in educational resources for beginners, please
 visit [this](https://gitlab.version.fz-juelich.de/MLDL_FZJ/MLDL_FZJ_Wiki/wikis/Education) page.

-**Note:** This tutorial does not support JUWELS at the moment. We hope to include the steps for JUWELS soon.
+## Announcements
+
+1. Tensorflow and Keras examples (with and without Horovod) are now fully functional on JUWELS as well.
+2. Python 2 support has been removed from the tutorial for all frameworks except Caffe.

 # Table of contents
 <!-- TOC -->
@@ -88,7 +91,7 @@ Otherwise please join the `PADC` and `CPADC` projects.

 **Note:** From here on it is assumed that you already have an account on your required supercomputer.

-### 4.1 JURECA
+### 4.1 JURECA and JUWELS

 Following are the steps required to login (more information 
 [here](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/UserInfo/QuickIntroduction.html?nn=1803700)).
@@ -147,20 +150,13 @@ systems. To learn more about Git LFS, click [here](http://gitlab.pages.jsc.fz-ju

 ### 5.1 JURECA

-1.  Load the required module stage:
-
-    ```
-    module use /usr/local/software/jureca/OtherStages
-    module load Stages/2018b
-    ```
-
-2.  Load the Git LFS module:
+1.  Load the Git LFS module:

-    `module load git-lfs/2.6.1`
-3.  Initialize Git LFS:
+    `module load git-lfs`
+2.  Initialize Git LFS:

    `git lfs install`
-4.  Clone the repository, including the datasets:
+3.  Clone the repository, including the datasets:

    `git lfs clone https://gitlab.version.fz-juelich.de/khalid1/ml_dl_on_supercomputers.git`


--- a/horovod/README.md
+++ b/horovod/README.md
@@ -13,11 +13,6 @@ slightly modified. Our changes are limited to,
 statements that demonstrate the use of Horovod follow a comment beginning with 
 `[Horovod]` (as added by Horovod developers).

-**Caution:** Where job submission scripts are available for both Python 2 and Python 3, please 
-do not submit both Python 2 and Python 3 jobs simultaneously, as one of the jobs might fail. If 
-you would like to try both, please run these in tandem such that the second job is started only
-after the first is finished.
-
 ## Keras samples

 The following Keras samples are included:
@@ -29,6 +24,9 @@ few more advanced Horovod features are used.

 ## PyTorch samples

+**Note:** PyTorch samples currently DO NOT work on JURECA and JUWELS. These 
+do however work on JURON.
+
 The following PyTorch samples are included:

 1.  `mnist.py`: Demonstrates distributed training using Horovod with PyTorch. A 
@@ -36,7 +34,7 @@ simple convolutional neural network is trained on the MNIST dataset.
 2.  `synthetic_benchmark.py`: A benchmark that can be used to measure performance 
 of PyTorch with Horovod without using any external dataset.

-**Note:** The job scripts for JURECA are prefixed with `.` for these samples, so that 
+**Note:** The job scripts for JURECA and JUWELS are prefixed with `.` for these samples, so that 
 these scripts do not appear in the directory listing. The reason for doing this is
 that our testing revealed issues with multi-node training. As soon as the issue has 
 been resolved, we'll make the scripts available.

--- a/horovod/keras/mnist.py
+++ b/horovod/keras/mnist.py
@@ -35,7 +35,7 @@ batch_size = 128
 num_classes = 10

 # Horovod: adjust number of epochs based on number of GPUs.
-epochs = int(math.ceil(12.0 / hvd.size()))
+epochs = int(math.ceil(16.0 / hvd.size()))

 # Input image dimensions
 img_rows, img_cols = 28, 28

--- a/horovod/keras/mnist_advanced.py
+++ b/horovod/keras/mnist_advanced.py
@@ -36,7 +36,7 @@ num_classes = 10

 # Enough epochs to demonstrate learning rate warmup and the reduction of
 # learning rate when training plateaues.
-epochs = 12
+epochs = 16

 # Input image dimensions
 img_rows, img_cols = 28, 28

--- a/horovod/keras/submit_job_jureca_python3.sh
+++ b/horovod/keras/submit_job_jureca_python3.sh
@@ -2,23 +2,21 @@

 # Slurm job configuration
 #SBATCH --nodes=2
-#SBATCH --ntasks=4
-#SBATCH --ntasks-per-node=2
+#SBATCH --ntasks=8
+#SBATCH --ntasks-per-node=4
 #SBATCH --output=output_%j.out
 #SBATCH --error=error_%j.er
 #SBATCH --time=00:10:00
 #SBATCH --job-name=HOROVOD_KERAS_MNIST
-#SBATCH --gres=gpu:2 --partition=develgpus
+#SBATCH --gres=gpu:4 --partition=develgpus
 #SBATCH --mail-type=ALL

 # Load the required modules
-module use /usr/local/software/jureca/OtherStages
-module load Stages/2018b
-module load GCC/7.3.0
-module load MVAPICH2/2.3-GDR
-module load TensorFlow/1.12.0-GPU-Python-3.6.6
-module load Keras/2.2.4-GPU-Python-3.6.6
-module load Horovod/0.15.2-GPU-Python-3.6.6
+module load GCC/8.3.0
+module load MVAPICH2/2.3.1-GDR
+module load TensorFlow/1.13.1-GPU-Python-3.6.8
+module load Keras/2.2.4-GPU-Python-3.6.8
+module load Horovod/0.16.2-GPU-Python-3.6.8

 # Run the program
 srun python -u mnist.py
--- a/horovod/keras/submit_job_jureca_python2.sh
+++ b/horovod/keras/submit_job_jureca_python2.sh
-#!/usr/bin/env bash
-
-# Slurm job configuration
-#SBATCH --nodes=2
-#SBATCH --ntasks=4
-#SBATCH --ntasks-per-node=2
-#SBATCH --output=output_%j.out
-#SBATCH --error=error_%j.er
-#SBATCH --time=00:10:00
-#SBATCH --job-name=HOROVOD_KERAS_MNIST
-#SBATCH --gres=gpu:2 --partition=develgpus
-#SBATCH --mail-type=ALL
-
-# Load the required modules
-module use /usr/local/software/jureca/OtherStages
-module load Stages/2018b
-module load GCC/7.3.0
-module load MVAPICH2/2.3-GDR
-module load TensorFlow/1.12.0-GPU-Python-2.7.15
-module load Keras/2.2.4-GPU-Python-2.7.15
-module load Horovod/0.15.2-GPU-Python-2.7.15
-
-# Run the program
-srun python -u mnist.py
--- a/horovod/keras/submit_job_juron_python3.sh
+++ b/horovod/keras/submit_job_juron_python3.sh
--- a/horovod/keras/submit_job_juron_python2.sh
+++ b/horovod/keras/submit_job_juron_python2.sh
-#!/usr/bin/env bash
-
-#BSUB -q normal
-#BSUB -W 10
-#BSUB -n 4
-#BSUB -R "span[ptile=2]"
-#BSUB -gpu "num=2"
-#BSUB -e "error.%J.er"
-#BSUB -o "output_%J.out"
-#BSUB -J HOROVOD_KERAS_MNIST
-
-# Load the required modules
-module load python/2.7.14
-module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
-module load horovod/0.15.2
-module load keras/2.2.4
-
-# Run the program
-mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
-        -x PATH -mca pml ob1 -mca btl ^openib python -u mnist.py
--- a/horovod/pytorch/.submit_job_jureca_python3.sh
+++ b/horovod/pytorch/.submit_job_jureca_python3.sh
--- a/horovod/pytorch/.submit_job_jureca_python2.sh
+++ b/horovod/pytorch/.submit_job_jureca_python2.sh
-#!/usr/bin/env bash
-
-# Slurm job configuration
-#SBATCH --nodes=2
-#SBATCH --ntasks=4
-#SBATCH --ntasks-per-node=2
-#SBATCH --output=output_%j.out
-#SBATCH --error=error_%j.er
-#SBATCH --time=00:10:00
-#SBATCH --job-name=HOROVOD_PYTORCH_MNIST
-#SBATCH --gres=gpu:2 --partition=develgpus
-#SBATCH --mail-type=ALL
-
-# Load the required modules
-module load GCC/7.3.0
-module load MVAPICH2/2.3-GDR
-module load PyTorch/1.0.0-GPU-Python-2.7.15
-module load torchvision/0.2.1-GPU-Python-2.7.15
-module load Horovod/0.15.2-GPU-Python-2.7.15
-
-# Run the program
-srun python -u mnist.py
--- a/horovod/pytorch/submit_job_juron_python3.sh
+++ b/horovod/pytorch/submit_job_juron_python3.sh
--- a/horovod/tensorflow/submit_job_jureca_python3.sh
+++ b/horovod/tensorflow/submit_job_jureca_python3.sh
@@ -2,22 +2,21 @@

 # Slurm job configuration
 #SBATCH --nodes=2
-#SBATCH --ntasks=4
-#SBATCH --ntasks-per-node=2
+#SBATCH --ntasks=8
+#SBATCH --ntasks-per-node=4
 #SBATCH --output=output_%j.out
 #SBATCH --error=error_%j.er
 #SBATCH --time=00:10:00
 #SBATCH --job-name=HOROVOD_TFLOW_MNIST
-#SBATCH --gres=gpu:2 --partition=develgpus
+#SBATCH --gres=gpu:4 --partition=develgpus
 #SBATCH --mail-type=ALL

 # Load the required modules
-module use /usr/local/software/jureca/OtherStages
-module load Stages/2018b
-module load GCC/7.3.0
-module load MVAPICH2/2.3-GDR
-module load TensorFlow/1.12.0-GPU-Python-3.6.6
-module load Horovod/0.15.2-GPU-Python-3.6.6
+module load GCC/8.3.0
+module load MVAPICH2/2.3.1-GDR
+module load TensorFlow/1.13.1-GPU-Python-3.6.8
+module load Keras/2.2.4-GPU-Python-3.6.8
+module load Horovod/0.16.2-GPU-Python-3.6.8

 # Run the program
 srun python -u mnist.py
--- a/horovod/tensorflow/submit_job_jureca_python2.sh
+++ b/horovod/tensorflow/submit_job_jureca_python2.sh
-#!/usr/bin/env bash
-
-# Slurm job configuration
-#SBATCH --nodes=2
-#SBATCH --ntasks=4
-#SBATCH --ntasks-per-node=2
-#SBATCH --output=output_%j.out
-#SBATCH --error=error_%j.er
-#SBATCH --time=00:10:00
-#SBATCH --job-name=HOROVOD_TFLOW_MNIST
-#SBATCH --gres=gpu:2 --partition=develgpus
-#SBATCH --mail-type=ALL
-
-# Load the required modules
-module use /usr/local/software/jureca/OtherStages
-module load Stages/2018b
-module load GCC/7.3.0
-module load MVAPICH2/2.3-GDR
-module load TensorFlow/1.12.0-GPU-Python-2.7.15
-module load Horovod/0.15.2-GPU-Python-2.7.15
-
-# Run the program
-srun python -u mnist.py
--- a/horovod/tensorflow/submit_job_juron_python3.sh
+++ b/horovod/tensorflow/submit_job_juron_python3.sh
--- a/horovod/tensorflow/submit_job_juron_python2.sh
+++ b/horovod/tensorflow/submit_job_juron_python2.sh
-#!/usr/bin/env bash
-
-#BSUB -q normal
-#BSUB -W 10
-#BSUB -n 4
-#BSUB -R "span[ptile=2]"
-#BSUB -gpu "num=2"
-#BSUB -e "error.%J.er"
-#BSUB -o "output_%J.out"
-#BSUB -J HOROVOD_TFLOW_MNIST
-
-# Load the required modules
-module load python/2.7.14
-module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
-module load horovod/0.15.2
-
-# Run the program
-mpirun -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
-        -x PATH -mca pml ob1 -mca btl ^openib python -u mnist.py
--- a/keras/submit_job_jureca_python3.sh
+++ b/keras/submit_job_jureca_python3.sh
@@ -12,11 +12,9 @@
 #SBATCH --mail-type=ALL

 # Load the required modules
-module use /usr/local/software/jureca/OtherStages
-module load Stages/2018b
-module load GCC/7.3.0
-module load TensorFlow/1.12.0-GPU-Python-3.6.6
-module load Keras/2.2.4-GPU-Python-3.6.6
+module load GCCcore/.8.3.0
+module load TensorFlow/1.13.1-GPU-Python-3.6.8
+module load Keras/2.2.4-GPU-Python-3.6.8

 # Run the program
 srun python -u mnist.py
--- a/keras/submit_job_jureca_python2.sh
+++ b/keras/submit_job_jureca_python2.sh
-#!/usr/bin/env bash
-
-# Slurm job configuration
-#SBATCH --nodes=1
-#SBATCH --ntasks=1
-#SBATCH --ntasks-per-node=1
-#SBATCH --output=output_%j.out
-#SBATCH --error=error_%j.er
-#SBATCH --time=00:10:00
-#SBATCH --job-name=KERAS_MNIST_CNN
-#SBATCH --gres=gpu:1 --partition=develgpus
-#SBATCH --mail-type=ALL
-
-# Load the required modules
-module use /usr/local/software/jureca/OtherStages
-module load Stages/2018b
-module load GCC/7.3.0
-module load TensorFlow/1.12.0-GPU-Python-2.7.15
-module load Keras/2.2.4-GPU-Python-2.7.15
-
-# Run the program
-srun python -u mnist.py
--- a/keras/submit_job_juron_python3.sh
+++ b/keras/submit_job_juron_python3.sh
--- a/keras/submit_job_juron_python2.sh
+++ b/keras/submit_job_juron_python2.sh
-#!/usr/bin/env bash
-
-#BSUB -q normal
-#BSUB -W 10
-#BSUB -n 1
-#BSUB -R "span[ptile=1]"
-#BSUB -gpu "num=1"
-#BSUB -e "error.%J.er"
-#BSUB -o "output_%J.out"
-#BSUB -J KERAS_MNIST_CNN
-
-# Load the required modules
-module load python/2.7.14
-module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
-module load keras/2.2.4
-
-# Run the program
-python -u mnist.py
--- a/pytorch/.submit_job_juwels_python3.sh
+++ b/pytorch/.submit_job_juwels_python3.sh