... | ... | @@ -4,10 +4,11 @@ |
|
|
|
|
|
1. Sign up at [JuDoor](https://judoor.fz-juelich.de/register).
|
|
|
2. Follow the [Jülich supercomputer setup tutorial](https://gitlab.version.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_dl_2021/course-material/-/blob/master/tutorials/day1/tutorial1/Tutorial1.ipynb) (there are some unrelated statements in here as this was a course introduction) to be able to SSH to the machines and start compute jobs via Slurm.
|
|
|
3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`: `[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
|
|
|
3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`:
|
|
|
`[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
|
|
|
4. Use `~/$USER` for code and `mkdir -p /p/scratch/ccstdl/$USER` for temporary (processed) data. Directories in `/p/scratch` are completely wiped every few months, so be careful about leaving important data here.
|
|
|
5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.
|
|
|
* for the model training, put datasets into `/p/scratch/ccstdl/$USER`, as compute nodes do not have access to `/p/largedata` pathes. Those are meant for getting data over to the machines and storage only.
|
|
|
5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.
|
|
|
For training, copy the datasets into `/p/scratch/ccstdl/$USER` as compute nodes do not have access to `/p/largedata`. `/p/scratch/ccstdl/$USER` is also a good location for smaller datasets.
|
|
|
|
|
|
## Cluster List
|
|
|
|
... | ... | @@ -72,7 +73,7 @@ wget --continue http://batbot.tv/ai/models/imagenet_16384_slim.ckpt -o /p/scratc |
|
|
wget --continue http://batbot.tv/ai/models/imagenet_16384.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384.yaml
|
|
|
```
|
|
|
|
|
|
_please see_ [_https://github.com/CompVis/taming-transformers_](https://github.com/CompVis/taming-transformers) _if any mirrors fail._
|
|
|
_Please see [https://github.com/CompVis/taming-transformers](https://github.com/CompVis/taming-transformers) if any mirrors fail._
|
|
|
|
|
|
The most recent addition from Heidelberg as of this writing is the GumbelVQGAN trained on Open Images:
|
|
|
|
... | ... | @@ -83,10 +84,17 @@ wget --continue https://heibox.uni-heidelberg.de/seafhttp/files/a5b2f0d5-bccd-44 |
|
|
|
|
|
### Queue Training
|
|
|
|
|
|
Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`. Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
|
|
|
Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.
|
|
|
Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
|
|
|
|
|
|
If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.
|
|
|
|
|
|
### Monitoring a Job
|
|
|
|
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`.
|
|
|
|
|
|
## Advaned Configuration
|
|
|
|
|
|
### Queue Training - Horovod
|
|
|
|
|
|
Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. `--flops_profiler` will stop training at after 200 steps. Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first. Change remaining parameters as needed.
|
... | ... | @@ -173,7 +181,10 @@ tmux |
|
|
sbatch horovod_dalle.sbatch
|
|
|
```
|
|
|
|
|
|
Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job: tmux-split-pane: `Ctrl-b "` `watch squeue --user $USER # check queue status every 2 s`
|
|
|
Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job:
|
|
|
|
|
|
1. tmux-split-pane: `Ctrl-b "`
|
|
|
2. `watch squeue --user $USER # check queue status every 2 s`
|
|
|
|
|
|
### Distributed Training - Horovod
|
|
|
|
... | ... | @@ -195,7 +206,7 @@ There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_ |
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
An example of a run - on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
|
|
|
An example of a run – on two nodes across 8xV100 using DeepSpeed (ZeRO-optimization disabled):
|
|
|
|
|
|
```shell
|
|
|
|
... | ... | @@ -243,7 +254,9 @@ deepspeed_config = { |
|
|
}
|
|
|
```
|
|
|
|
|
|
- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes
|
|
|
- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning
|
|
|
|
|
|
Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes.
|
|
|
|
|
|
```python
|
|
|
deepspeed_config = {
|
... | ... | @@ -274,7 +287,9 @@ deepspeed_config = { |
|
|
}
|
|
|
```
|
|
|
|
|
|
- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
|
|
|
- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning
|
|
|
|
|
|
Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
|
|
|
|
|
|
```python
|
|
|
deepspeed_config = {
|
... | ... | @@ -294,8 +309,4 @@ deepspeed_config = { |
|
|
}
|
|
|
```
|
|
|
|
|
|
- For a lot more configuration options to tune, see <https://www.deepspeed.ai/docs/config-json>
|
|
|
|
|
|
## Monitoring a Job
|
|
|
|
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
|
\ No newline at end of file |
|
|
- For a lot more configuration options to tune, see <https://www.deepspeed.ai/docs/config-json>. |