Skip to content
Snippets Groups Projects
Commit 1738a3cd authored by Fahad Khalid's avatar Fahad Khalid
Browse files

Updated job submission scripts for all systems after testing. Added a setup script for JURON.

parent af1fefe5
No related branches found
No related tags found
No related merge requests found
...@@ -4,21 +4,30 @@ Please see the main docstring in each program for details. ...@@ -4,21 +4,30 @@ Please see the main docstring in each program for details.
# Notes # Notes
The `mnist_data_distributed.py` program requires the [`hpc4ns.distribution`]( On JURECA and JUWELS, the `mnist_data_distributed.py` program requires the [`hpc4ns.distribution`](
https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils#1-hpc4nsdistribution) https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils#1-hpc4nsdistribution)
module for distribution of training data filenames across multiple ranks. module for distribution of training data filenames across multiple ranks. On JURON, multiple additional
Please follow the steps below to install the required package. package are required. Please follow the steps below to setup the environment before submitting the
training job.
Note that a maximum of eight ranks can be used to run `mnist_data_distributed.py`, as there
are eight training files.
## JURECA and JUWELS
1. Change to the source directory for this sample, i.e., to `dl_on_supercomputers/horovod_data_distributed` 1. Change to the source directory for this sample, i.e., to `dl_on_supercomputers/horovod_data_distributed`
2. Load the system-wide Python module. 2. Load the system-wide Python module: `module load Python/3.6.8`
* On JURECA and JUWELS: `module load Python/3.6.8`
* On JURON: `module load Python/3.6.1`
3. Install the `hpc4ns` package: 3. Install the `hpc4ns` package:
`pip install --user git+https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils.git` `pip install --user git+https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils.git`
The job can be submitted once the `hpc4ns` package is installed. 4. Submit the job
## JURON
**Note:** A maximum of eight ranks can be used to run `mnist_data_distributed.py`, as there 1. Change to the source directory for this sample, i.e., to `dl_on_supercomputers/horovod_data_distributed`
are eight training files. 2. Setup a Python virtual environment with the required packages (may take upto 5 minutes): `./setup_juron.sh`
\ No newline at end of file 3. Submit the job: `bsub < submit_job_juron.sh`
**Note:** The setup is required only once. Unless you explicitly remove the virtual environment, the same
setup can be used to run the example multiple times.
#!/usr/bin/env bash
# Load the Python module
module load python/3.6.1
# Create a virtual environment
python -m venv venv_dl_hpc4ns
# Activate the virtual environment
source venv_dl_hpc4ns/bin/activate
# Upgrade pip and setuptools
pip install -U pip setuptools
# Install mpi4py
env MPICC=/gpfs/software/opt/openmpi/3.1.2-gcc_5.4.0-cuda_10.0.130/bin/mpicc pip install mpi4py
# Install six
pip install six
# Install hpc4ns
pip install git+https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils.git
printf "%s\n" "Setup complete."
...@@ -13,7 +13,8 @@ ...@@ -13,7 +13,8 @@
# Load the required modules # Load the required modules
module load GCC/8.3.0 module load GCC/8.3.0
module load MVAPICH2/2.3.1-GDR module load MVAPICH2/2.3.2-GDR
module load mpi4py/3.0.1-Python-3.6.8
module load TensorFlow/1.13.1-GPU-Python-3.6.8 module load TensorFlow/1.13.1-GPU-Python-3.6.8
module load Horovod/0.16.2-GPU-Python-3.6.8 module load Horovod/0.16.2-GPU-Python-3.6.8
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
#BSUB -q normal #BSUB -q normal
#BSUB -W 10 #BSUB -W 10
#BSUB -n 8 #BSUB -n 4
#BSUB -R "span[ptile=4]" #BSUB -R "span[ptile=4]"
#BSUB -gpu "num=4" #BSUB -gpu "num=4"
#BSUB -e "error.%J.er" #BSUB -e "error.%J.er"
...@@ -14,6 +14,9 @@ module load python/3.6.1 ...@@ -14,6 +14,9 @@ module load python/3.6.1
module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130 module load tensorflow/1.12.0-gcc_5.4.0-cuda_10.0.130
module load horovod/0.15.2 module load horovod/0.15.2
# Activate the virtual environment
source venv_dl_hpc4ns/bin/activate
# Run the program # Run the program
mpirun -bind-to none \ mpirun -bind-to none \
-map-by slot \ -map-by slot \
......
...@@ -13,7 +13,8 @@ ...@@ -13,7 +13,8 @@
# Load the required modules # Load the required modules
module load GCC/8.3.0 module load GCC/8.3.0
module load MVAPICH2/2.3.1-GDR module load MVAPICH2/2.3.2-GDR
module load mpi4py/3.0.1-Python-3.6.8
module load TensorFlow/1.13.1-GPU-Python-3.6.8 module load TensorFlow/1.13.1-GPU-Python-3.6.8
module load Horovod/0.16.2-GPU-Python-3.6.8 module load Horovod/0.16.2-GPU-Python-3.6.8
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment