Skip to content
Snippets Groups Projects
Select Git revision
  • d6dcba6c6262d3e78ac49d7ab7ee27b89833ea80
  • master default protected
  • tf2
  • tf2_pytorch
  • issue_3
  • issue_2
  • 2019a
  • juwels_2019a
  • jureca_2019_a
9 results

horovod_data_distributed

user avatar
Fahad Khalid authored
d6dcba6c
History

Introduction

Please see the main docstring in each program for details.

Notes

On JURECA and JUWELS, the mnist_data_distributed.py program requires the hpc4ns.distribution module for distribution of training data filenames across multiple ranks. On JURON, multiple additional package are required. Please follow the steps below to setup the environment before submitting the training job.

Note that a maximum of eight ranks can be used to run mnist_data_distributed.py, as there are eight training files.

JURECA and JUWELS

  1. Change to the source directory for this sample, i.e., to dl_on_supercomputers/horovod_data_distributed

  2. Load the system-wide Python module: module load Python/3.6.8

  3. Install the hpc4ns package:

    pip install --user git+https://gitlab.version.fz-juelich.de/hpc4ns/hpc4ns_utils.git

  4. Submit the job

JURON

  1. Change to the source directory for this sample, i.e., to dl_on_supercomputers/horovod_data_distributed
  2. Setup a Python virtual environment with the required packages (may take upto 5 minutes): ./setup_juron.sh
  3. Submit the job: bsub < submit_job_juron.sh

Note: The setup is required only once. Unless you explicitly remove the virtual environment, the same setup can be used to run the example multiple times.