... | ... | @@ -100,6 +100,8 @@ You can interactively attach to a running job via `srun --pty --jobid <job-id> b |
|
|
|
|
|
### Queue Training – Horovod
|
|
|
|
|
|
HINT: info is outdated, do not use the scripts on JUWELS Booster or JURECA; develgpus is a partition that exists on older JUWELS machine equipped with V100.
|
|
|
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. `--flops_profiler` will stop training at after 200 steps. Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first. Change remaining parameters as needed.
|
|
|
|
|
|
Save the following to a file `horovod_dalle.sbatch`
|
... | ... | |