... | ... | @@ -94,7 +94,7 @@ You can interactively attach to a running job via `srun --pty --jobid <job-id> b |
|
|
|
|
|
## Advanced Configuration
|
|
|
|
|
|
### Queue Training - Horovod
|
|
|
### Queue Training – Horovod
|
|
|
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. `--flops_profiler` will stop training at after 200 steps. Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first. Change remaining parameters as needed.
|
|
|
|
... | ... | @@ -185,7 +185,7 @@ Depending on current load/priority, it may take a few minutes before your job is |
|
|
1. tmux-split-pane: `Ctrl-b "`
|
|
|
2. `watch squeue --user $USER # check queue status every 2 s`
|
|
|
|
|
|
### Distributed Training - Horovod
|
|
|
### Distributed Training – Horovod
|
|
|
|
|
|
To conduct distributed training across many nodes (each node having up to 4 GPUs), have a look at the script collection [containing examples using Horovod ](https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/projects/large_scale_reproducing/dall-e/dalle-pytorch/-/tree/fzj/scripts).
|
|
|
|
... | ... | @@ -197,7 +197,7 @@ Further up-to-date `sbatch` scripts are in `/p/scratch/ccstdl/ebert1/dalle` (`hv |
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
### Queue Training - DeepSpeed
|
|
|
### Queue Training – DeepSpeed
|
|
|
|
|
|
There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_dalle.sbatch`, `run_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:
|
|
|
|
... | ... | |