... | ... | @@ -43,8 +43,6 @@ There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle`. Copy |
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
## Train DALLE-pytorch
|
|
|
|
|
|
### Choose a Variational Autoencoder
|
|
|
The VAE is responsible for representing images efficiently via pretraining. You can use a VAE released by OpenAI, the various flavors of VQGAN from Heidelberg, or you can train your own discrete VAE from scratch.
|
|
|
|
... | ... | @@ -69,15 +67,13 @@ wget --continue http://batbot.tv/ai/models/gumbel_f8_8192.ckpt -o /p/scratch/ccs |
|
|
wget --continue https://heibox.uni-heidelberg.de/seafhttp/files/a5b2f0d5-bccd-4421-a9a5-864df8659560/model.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/gumbel_f8_8192.yaml
|
|
|
```
|
|
|
|
|
|
## Running jobs
|
|
|
### Queue Training
|
|
|
Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.
|
|
|
Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
|
|
|
|
|
|
If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.
|
|
|
|
|
|
## (Data Parallel)
|
|
|
|
|
|
### Horovod
|
|
|
### Queue Training - Horovod
|
|
|
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS.
|
|
|
`--flops_profiler` will stop training at after 200 steps.
|
... | ... | @@ -148,7 +144,7 @@ srun -A cstdl --cpu-bind=none \ |
|
|
|
|
|
```
|
|
|
|
|
|
### DeepSpeed
|
|
|
### Queue Training - DeepSpeed
|
|
|
|
|
|
Run 200 steps across 4xV100 using DeepSpeed (zero-optimization disabled)
|
|
|
```sh
|
... | ... | @@ -181,6 +177,11 @@ deepspeed train_dalle.py \ |
|
|
```
|
|
|
|
|
|
#### Configure DeepSpeed ZeRO Offload/Infinity
|
|
|
|
|
|
DeepSpeed will allow you to increase the total parameter count of your model beyond what can fit in a single GPU typically.
|
|
|
|
|
|
By default, `DeepSpeed` and `horovod` are very similar. By modifying configuration, DeepSpeed enables various degrees of network/optimizer CPU offloading and clever state partition across multiple GPU's and/or nodes. ZeRO stage 3 is recommended for the full benefits of training on the supercomputer. Using this configuration with 16-bit precision is presently as close as DALLE-pytorch gets to the training regime used by OpenAI in DALL-E.
|
|
|
|
|
|
> Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.
|
|
|
|
|
|
In order to change the DeepSpeed stage, find the python dict in `train_dalle.py` with the name `deepspeed_config` and modify as such:
|
... | ... | |