Introduction
Please see the main docstring in each program for details.
Notes
On JURECA and JUWELS, the mnist_data_distributed.py
program requires the hpc4ns.distribution
module for distribution of training data filenames across multiple ranks. On JURON, multiple additional
package are required. Please follow the steps below to setup the environment before submitting the
training job.
Note that a maximum of eight ranks can be used to run mnist_data_distributed.py
, as there
are eight training files.
JURECA and JUWELS
-
Change to the source directory for this sample, i.e., to
dl_on_supercomputers/horovod_data_distributed
-
Load the system-wide Python module:
module load Python/3.6.8
-
Install the
hpc4ns
package:pip install --user git+https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils.git
-
Submit the job
JURON
- Change to the source directory for this sample, i.e., to
dl_on_supercomputers/horovod_data_distributed
- Setup a Python virtual environment with the required packages (may take upto 5 minutes):
./setup_juron.sh
- Submit the job:
bsub < submit_job_juron.sh
Note: The setup is required only once. Unless you explicitly remove the virtual environment, the same setup can be used to run the example multiple times.