... | @@ -75,7 +75,7 @@ If everything runs fine, you can change the paths in the `sbatch` scripts accord |
... | @@ -75,7 +75,7 @@ If everything runs fine, you can change the paths in the `sbatch` scripts accord |
|
|
|
|
|
### Queue Training - Horovod
|
|
### Queue Training - Horovod
|
|
|
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS.
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS.
|
|
`--flops_profiler` will stop training at after 200 steps.
|
|
`--flops_profiler` will stop training at after 200 steps.
|
|
Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first.
|
|
Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first.
|
|
Change remaining parameters as needed.
|
|
Change remaining parameters as needed.
|
... | @@ -156,6 +156,11 @@ Depending on current load/priority, it may take a few minutes before your job is |
... | @@ -156,6 +156,11 @@ Depending on current load/priority, it may take a few minutes before your job is |
|
tmux-split-pane: `Ctrl-b "`
|
|
tmux-split-pane: `Ctrl-b "`
|
|
`watch squeue --user $USER # check queue status every 2 s`
|
|
`watch squeue --user $USER # check queue status every 2 s`
|
|
|
|
|
|
|
|
### Distributed Training - Horovod
|
|
|
|
To conduct distributed training across many nodes (each node having up to 4 GPUs), have a look at the script collection [containing examples using Horovod ](https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/projects/large_scale_reproducing/dall-e/dalle-pytorch/-/tree/fzj/scripts).
|
|
|
|
|
|
|
|
For JUWELS Booster for instance, adapt `juwelsbooster.sh` with desired number of nodes. For an example run, you can use `juwelsbooster.sh run_cub.sh`, adapting `run_cub.sh` accordingly.
|
|
|
|
|
|
### Queue Training - DeepSpeed
|
|
### Queue Training - DeepSpeed
|
|
|
|
|
|
Run 200 steps across 4xV100 using DeepSpeed (zero-optimization disabled)
|
|
Run 200 steps across 4xV100 using DeepSpeed (zero-optimization disabled)
|
... | @@ -260,4 +265,4 @@ deepspeed_config = { |
... | @@ -260,4 +265,4 @@ deepspeed_config = { |
|
|
|
|
|
## Monitoring a Job
|
|
## Monitoring a Job
|
|
|
|
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
\ No newline at end of file |
|
|