... | ... | @@ -203,13 +203,15 @@ cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch |
|
|
|
|
|
### Queue Training – DeepSpeed
|
|
|
|
|
|
HINT: this section is largely outdated (2021) and is kept here for archiving purpose.
|
|
|
|
|
|
There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_dalle.sbatch`, `run_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```shell
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
An example of a run – on two nodes across 8xV100 using DeepSpeed (ZeRO-optimization disabled):
|
|
|
An example of a run (HINT: info is outdated, do not use the script on JUWELS Booster or JURECA; develgpus is a partition that exists on older JUWELS machine) – on two nodes across 8xV100 using DeepSpeed (ZeRO-optimization disabled):
|
|
|
|
|
|
```shell
|
|
|
|
... | ... | |