... | ... | @@ -20,9 +20,19 @@ There are several options for supercomputer machines to choose from. Find a [lis |
|
|
We'll now set up your environment so you can start DALLE-pytorch training runs. The setup tutorial briefly mentions the module system at JSC at the bottom. We'll want to use the provided modules as often as possible as they are compiled and optimized for each cluster. As DeepSpeed is currently an experimental package, it is not included in the default meta module. We tell the module system to look at `$OTHERSTAGES` to get additional meta modules:
|
|
|
|
|
|
```sh
|
|
|
module use $OTHERSTAGES
|
|
|
module purge
|
|
|
module load Stages/Devel-2020 GCC/9.3.0 OpenMPI DeepSpeed
|
|
|
|
|
|
ml purge
|
|
|
ml use $OTHERSTAGES
|
|
|
ml Stages/2020
|
|
|
ml GCC/9.3.0
|
|
|
ml OpenMPI/4.1.0rc1
|
|
|
ml CUDA/11.0
|
|
|
ml cuDNN/8.0.2.39-CUDA-11.0
|
|
|
ml NCCL/2.8.3-1-CUDA-11.0
|
|
|
ml PyTorch/1.7.0-Python-3.8.5
|
|
|
ml torchvision/0.8.2-Python-3.8.5
|
|
|
ml Horovod/0.20.3-Python-3.8.5
|
|
|
|
|
|
```
|
|
|
|
|
|
Now let's set up DALLE-pytorch and its dependencies. We would like to use a Python `venv` here but currently these cause trouble with DeepSpeed, so we have to install to our user directory:
|
... | ... | @@ -94,8 +104,17 @@ HOME_PATH= |
|
|
CHECKPOINT_NAME=
|
|
|
LOGS_PATH=
|
|
|
|
|
|
module purge
|
|
|
module load Stages/2020 GCC OpenMPI PyTorch torchvision Horovod
|
|
|
ml purge
|
|
|
ml use $OTHERSTAGES
|
|
|
ml Stages/2020
|
|
|
ml GCC/9.3.0
|
|
|
ml OpenMPI/4.1.0rc1
|
|
|
ml CUDA/11.0
|
|
|
ml cuDNN/8.0.2.39-CUDA-11.0
|
|
|
ml NCCL/2.8.3-1-CUDA-11.0
|
|
|
ml PyTorch/1.7.0-Python-3.8.5
|
|
|
ml torchvision/0.8.2-Python-3.8.5
|
|
|
ml Horovod/0.20.3-Python-3.8.5
|
|
|
|
|
|
DATASET_PATH=/p/scratch/ccstdl/${HOME_PATH}/LAION_SAMPLE/
|
|
|
VQGAN_MODEL_PATH=/p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384_slim.ckpt
|
... | ... | @@ -116,7 +135,7 @@ TEXT_SEQ_LEN=128 |
|
|
EPOCHS=1
|
|
|
|
|
|
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
|
|
srun -A cstdl --cpu-bind=none \
|
|
|
srun -A cstdl --cpu-bind=v \
|
|
|
python -u train_dalle.py \
|
|
|
--epochs="$EPOCHS" \
|
|
|
--clip_grad_norm="$CLIP_GRAD_NORM" \
|
... | ... | @@ -199,7 +218,7 @@ module load GCC/9.3.0 OpenMPI DeepSpeed |
|
|
# ...define vars
|
|
|
export WANDB_MODE=disabled
|
|
|
|
|
|
srun --cpu-bind=none \
|
|
|
srun --cpu-bind=v \
|
|
|
python -u train_dalle.py --image_text_folder "$DATASET_PATH" --deepspeed --fp16
|
|
|
|
|
|
```
|
... | ... | |