... | ... | @@ -4,10 +4,10 @@ |
|
|
|
|
|
1. Sign up at [JuDoor](https://judoor.fz-juelich.de/register).
|
|
|
2. Follow the [Jülich supercomputer setup tutorial](https://gitlab.version.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_dl_2021/course-material/-/blob/master/tutorials/day1/tutorial1/Tutorial1.ipynb) (there are some unrelated statements in here as this was a course introduction) to be able to SSH to the machines and start compute jobs via Slurm.
|
|
|
3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`:
|
|
|
`[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
|
|
|
3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`: `[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
|
|
|
4. Use `~/$USER` for code and `mkdir -p /p/scratch/ccstdl/$USER` for temporary (processed) data. Directories in `/p/scratch` are completely wiped every few months, so be careful about leaving important data here.
|
|
|
5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.
|
|
|
* for the model training, put datasets into `/p/scratch/ccstdl/$USER`, as compute nodes do not have access to `/p/largedata` pathes. Those are meant for getting data over to the machines and storage only.
|
|
|
|
|
|
## Cluster List
|
|
|
|
... | ... | @@ -19,7 +19,7 @@ There are several options for supercomputer machines to choose from. Find a [lis |
|
|
|
|
|
We'll now set up your environment so you can start DALLE-pytorch training runs. The setup tutorial briefly mentions the module system at JSC at the bottom. We'll want to use the provided modules as often as possible as they are compiled and optimized for each cluster. As DeepSpeed is currently an experimental package, it is not included in the default meta module. We tell the module system to look at `$OTHERSTAGES` to get additional meta modules:
|
|
|
|
|
|
```sh
|
|
|
```shell
|
|
|
ml purge
|
|
|
ml use $OTHERSTAGES
|
|
|
ml Stages/2020
|
... | ... | @@ -35,7 +35,7 @@ ml Horovod/0.20.3-Python-3.8.5 |
|
|
|
|
|
Now let's set up DALLE-pytorch and its dependencies. We use a Python `venv` here to separate the project Python environment from our user Python environment. If this causes trouble, try again without the `venv`, appending `--user` to the install commands, thus installing to your user directory:
|
|
|
|
|
|
```sh
|
|
|
```shell
|
|
|
cd ~/$USER
|
|
|
git clone https://github.com/lucidrains/DALLE-pytorch
|
|
|
cd DALLE-pytorch
|
... | ... | @@ -49,49 +49,51 @@ For simplicity, we disable WandB in the `sbatch` scripts (see below) as we can't |
|
|
|
|
|
There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle`. Copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```sh
|
|
|
```shell
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
### Choose a Variational Autoencoder
|
|
|
|
|
|
The VAE is responsible for representing images efficiently via pretraining. You can use a VAE released by OpenAI, the various flavors of VQGAN from Heidelberg, or you can train your own discrete VAE from scratch.
|
|
|
|
|
|
We do not have internet access from the compute nodes, so we cannot download checkpoints while our script is running. You can find common material in `/p/scratch/ccstdl/ebert1/dalle`, including downloaded checkpoints. We link these so DALLE-pytorch can find them at the expected location:
|
|
|
|
|
|
```sh
|
|
|
```shell
|
|
|
mkdir -p ~/.cache/dalle
|
|
|
ln -s /p/scratch/ccstdl/ebert1/dalle/checkpoints/* ~/.cache/dalle
|
|
|
```
|
|
|
|
|
|
You can also use arbitrary checkpoints matching the VQGAN architecture if you have a `.yaml` and `.ckpt` in the format described in https://github.com/CompVis/taming-transformers. For instance:
|
|
|
```sh
|
|
|
You can also use arbitrary checkpoints matching the VQGAN architecture if you have a `.yaml` and `.ckpt` in the format described in <https://github.com/CompVis/taming-transformers>. For instance:
|
|
|
|
|
|
```shell
|
|
|
HOME_PATH=lastname1
|
|
|
wget --continue http://batbot.tv/ai/models/imagenet_16384_slim.ckpt -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384_slim.ckpt
|
|
|
wget --continue http://batbot.tv/ai/models/imagenet_16384.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384.yaml
|
|
|
```
|
|
|
_please see https://github.com/CompVis/taming-transformers if any mirrors fail._
|
|
|
|
|
|
_please see_ [_https://github.com/CompVis/taming-transformers_](https://github.com/CompVis/taming-transformers) _if any mirrors fail._
|
|
|
|
|
|
The most recent addition from Heidelberg as of this writing is the GumbelVQGAN trained on Open Images:
|
|
|
```sh
|
|
|
|
|
|
```shell
|
|
|
wget --continue http://batbot.tv/ai/models/gumbel_f8_8192.ckpt -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/gumbel_f8_8192.ckpt
|
|
|
wget --continue https://heibox.uni-heidelberg.de/seafhttp/files/a5b2f0d5-bccd-4421-a9a5-864df8659560/model.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/gumbel_f8_8192.yaml
|
|
|
```
|
|
|
|
|
|
### Queue Training
|
|
|
Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.
|
|
|
Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
|
|
|
|
|
|
Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`. Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
|
|
|
|
|
|
If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.
|
|
|
|
|
|
### Queue Training - Horovod
|
|
|
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS.
|
|
|
`--flops_profiler` will stop training at after 200 steps.
|
|
|
Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first.
|
|
|
Change remaining parameters as needed.
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. `--flops_profiler` will stop training at after 200 steps. Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first. Change remaining parameters as needed.
|
|
|
|
|
|
Save the following to a file `horovod_dalle.sbatch`
|
|
|
```sh
|
|
|
|
|
|
```shell
|
|
|
#!/usr/bin/env bash
|
|
|
# horovod_dalle.sbatch
|
|
|
#SBATCH --nodes 1
|
... | ... | @@ -162,27 +164,26 @@ srun -A cstdl --cpu-bind=v \ |
|
|
--truncate_captions \
|
|
|
--flops_profiler \
|
|
|
--distributed_backend="horovod" | tee "$LOGFILE"
|
|
|
|
|
|
```
|
|
|
|
|
|
Now you can queue the job:
|
|
|
```sh
|
|
|
|
|
|
```shell
|
|
|
tmux
|
|
|
sbatch horovod_dalle.sbatch
|
|
|
```
|
|
|
|
|
|
Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job:
|
|
|
tmux-split-pane: `Ctrl-b "`
|
|
|
`watch squeue --user $USER # check queue status every 2 s`
|
|
|
Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job: tmux-split-pane: `Ctrl-b "` `watch squeue --user $USER # check queue status every 2 s`
|
|
|
|
|
|
### Distributed Training - Horovod
|
|
|
|
|
|
To conduct distributed training across many nodes (each node having up to 4 GPUs), have a look at the script collection [containing examples using Horovod ](https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/projects/large_scale_reproducing/dall-e/dalle-pytorch/-/tree/fzj/scripts).
|
|
|
|
|
|
For JUWELS Booster for instance, adapt `juwelsbooster.sh` with desired number of nodes. For an example run, you can use `juwelsbooster.sh run_cub.sh`, adapting `run_cub.sh` accordingly.
|
|
|
For JUWELS Booster for instance, adapt `juwelsbooster.sh` with desired number of nodes. For an example run, you can use `juwelsbooster.sh run_cub.sh`, adapting `run_cub.sh` accordingly.
|
|
|
|
|
|
Further up-to-date `sbatch` scripts are in `/p/scratch/ccstdl/ebert1/dalle` (`hvd_dalle.sbatch`, `hvd_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```sh
|
|
|
```shell
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
... | ... | @@ -190,13 +191,13 @@ cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch |
|
|
|
|
|
There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_dalle.sbatch`, `run_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```sh
|
|
|
```shell
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
An example of a run -
|
|
|
on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
|
|
|
```sh
|
|
|
An example of a run - on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
|
|
|
|
|
|
```shell
|
|
|
|
|
|
#!/usr/bin/env bash
|
|
|
|
... | ... | @@ -220,7 +221,6 @@ export WANDB_MODE=disabled |
|
|
|
|
|
srun --cpu-bind=v \
|
|
|
python -u train_dalle.py --image_text_folder "$DATASET_PATH" --deepspeed --fp16
|
|
|
|
|
|
```
|
|
|
|
|
|
#### Configure DeepSpeed ZeRO Offload/Infinity
|
... | ... | @@ -232,7 +232,9 @@ By default, `DeepSpeed` and `horovod` are very similar. By modifying configurati |
|
|
> Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.
|
|
|
|
|
|
In order to change the DeepSpeed stage, find the python dict in `train_dalle.py` with the name `deepspeed_config` and modify as such:
|
|
|
|
|
|
- Stage 1: Optimizer State Partitioning
|
|
|
|
|
|
```python
|
|
|
deepspeed_config = {
|
|
|
"zero_optimization": {
|
... | ... | @@ -241,8 +243,8 @@ deepspeed_config = { |
|
|
}
|
|
|
```
|
|
|
|
|
|
- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning
|
|
|
Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes
|
|
|
- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes
|
|
|
|
|
|
```python
|
|
|
deepspeed_config = {
|
|
|
"zero_optimization": {
|
... | ... | @@ -255,6 +257,7 @@ deepspeed_config = { |
|
|
```
|
|
|
|
|
|
- Stage 3: Optimizer + Gradient + Parameter partitioning
|
|
|
|
|
|
```python
|
|
|
deepspeed_config = {
|
|
|
"zero_optimization": {
|
... | ... | @@ -271,8 +274,8 @@ deepspeed_config = { |
|
|
}
|
|
|
```
|
|
|
|
|
|
- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning
|
|
|
Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
|
|
|
- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
|
|
|
|
|
|
```python
|
|
|
deepspeed_config = {
|
|
|
"zero_optimization": {
|
... | ... | @@ -291,8 +294,8 @@ deepspeed_config = { |
|
|
}
|
|
|
```
|
|
|
|
|
|
- For a lot more configuration options to tune, see https://www.deepspeed.ai/docs/config-json
|
|
|
- For a lot more configuration options to tune, see <https://www.deepspeed.ai/docs/config-json>
|
|
|
|
|
|
## Monitoring a Job
|
|
|
|
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
|
\ No newline at end of file |