... | ... | @@ -161,35 +161,46 @@ To conduct distributed training across many nodes (each node having up to 4 GPUs |
|
|
|
|
|
For JUWELS Booster for instance, adapt `juwelsbooster.sh` with desired number of nodes. For an example run, you can use `juwelsbooster.sh run_cub.sh`, adapting `run_cub.sh` accordingly.
|
|
|
|
|
|
Further up-to-date `sbatch` scripts are in `/p/scratch/ccstdl/ebert1/dalle` (`hvd_dalle.sbatch`, `hvd_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```sh
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
### Queue Training - DeepSpeed
|
|
|
|
|
|
Run 200 steps across 4xV100 using DeepSpeed (zero-optimization disabled)
|
|
|
There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_dalle.sbatch`, `run_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```sh
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
An example of a run -
|
|
|
on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
|
|
|
```sh
|
|
|
|
|
|
#!/usr/bin/env bash
|
|
|
|
|
|
#SBATCH --nodes 1
|
|
|
#SBATCH --nodes 2
|
|
|
#SBATCH --tasks-per-node 4
|
|
|
#SBATCH --gres gpu
|
|
|
#SBATCH --gres gpu:4
|
|
|
#SBATCH -A cstdl
|
|
|
#SBATCH --partition develgpus
|
|
|
#SBATCH --partition develbooster
|
|
|
|
|
|
DATASET_PATH=YOUR_DATA_PATH/flickr30k_images/flickr30k_images
|
|
|
VAE_PATH=vae.pt
|
|
|
|
|
|
module purge
|
|
|
module load Stages/2020 GCC OpenMPI PyTorch torchvision DeepSpeed
|
|
|
module use "$OTHERSTAGES"
|
|
|
module load Stages/Devel-2020
|
|
|
module load GCC/9.3.0 OpenMPI DeepSpeed
|
|
|
# source env/bin/activate
|
|
|
|
|
|
# ...define vars
|
|
|
export WANDB_MODE=disabled
|
|
|
|
|
|
deepspeed train_dalle.py \
|
|
|
# ...options
|
|
|
--wds=jpg,txt \
|
|
|
--reversible \
|
|
|
--lr_decay \
|
|
|
--taming \
|
|
|
--shift_tokens \
|
|
|
--rotary_emb \
|
|
|
--truncate_captions \
|
|
|
--flops_profiler \
|
|
|
--distributed_backend="deepspeed" | tee "$LOGFILE"
|
|
|
srun --cpu-bind=none \
|
|
|
python -u train_dalle.py --image_text_folder "$DATASET_PATH" --deepspeed --fp16
|
|
|
|
|
|
```
|
|
|
|
... | ... | |