jenia jitsev · d7fd75cb
--- a/Home.md
+++ b/Home.md
@@ -4,10 +4,10 @@

 1. Sign up at [JuDoor](https://judoor.fz-juelich.de/register).
 2. Follow the [Jülich supercomputer setup tutorial](https://gitlab.version.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_dl_2021/course-material/-/blob/master/tutorials/day1/tutorial1/Tutorial1.ipynb) (there are some unrelated statements in here as this was a course introduction) to be able to SSH to the machines and start compute jobs via Slurm.
-3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`:
-`[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
+3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`: `[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
 4. Use `~/$USER` for code and `mkdir -p /p/scratch/ccstdl/$USER` for temporary (processed) data. Directories in `/p/scratch` are completely wiped every few months, so be careful about leaving important data here.
 5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.
+   * for the model training, put datasets into `/p/scratch/ccstdl/$USER`, as compute nodes do not have access to `/p/largedata` pathes. Those are meant for getting data over to the machines and storage only.

 ## Cluster List

@@ -19,7 +19,7 @@ There are several options for supercomputer machines to choose from. Find a [lis

 We'll now set up your environment so you can start DALLE-pytorch training runs. The setup tutorial briefly mentions the module system at JSC at the bottom. We'll want to use the provided modules as often as possible as they are compiled and optimized for each cluster. As DeepSpeed is currently an experimental package, it is not included in the default meta module. We tell the module system to look at `$OTHERSTAGES` to get additional meta modules:

-```sh
+```shell
 ml purge
 ml use $OTHERSTAGES
 ml Stages/2020
@@ -35,7 +35,7 @@ ml Horovod/0.20.3-Python-3.8.5

 Now let's set up DALLE-pytorch and its dependencies. We use a Python `venv` here to separate the project Python environment from our user Python environment. If this causes trouble, try again without the `venv`, appending `--user` to the install commands, thus installing to your user directory:

-```sh
+```shell
 cd ~/$USER
 git clone https://github.com/lucidrains/DALLE-pytorch
 cd DALLE-pytorch
@@ -49,49 +49,51 @@ For simplicity, we disable WandB in the `sbatch` scripts (see below) as we can't

 There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle`. Copy these into your local DALLE-pytorch clone:

-```sh
+```shell
 cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
 ```

 ### Choose a Variational Autoencoder
+
 The VAE is responsible for representing images efficiently via pretraining. You can use a VAE released by OpenAI, the various flavors of VQGAN from Heidelberg, or you can train your own discrete VAE from scratch.

 We do not have internet access from the compute nodes, so we cannot download checkpoints while our script is running. You can find common material in `/p/scratch/ccstdl/ebert1/dalle`, including downloaded checkpoints. We link these so DALLE-pytorch can find them at the expected location:

-```sh
+```shell
 mkdir -p ~/.cache/dalle
 ln -s /p/scratch/ccstdl/ebert1/dalle/checkpoints/* ~/.cache/dalle
 ```

-You can also use arbitrary checkpoints matching the VQGAN architecture if you have a `.yaml` and `.ckpt` in the format described in https://github.com/CompVis/taming-transformers. For instance:
-```sh
+You can also use arbitrary checkpoints matching the VQGAN architecture if you have a `.yaml` and `.ckpt` in the format described in <https://github.com/CompVis/taming-transformers>. For instance:
+
+```shell
 HOME_PATH=lastname1
 wget --continue http://batbot.tv/ai/models/imagenet_16384_slim.ckpt -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384_slim.ckpt
 wget --continue http://batbot.tv/ai/models/imagenet_16384.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384.yaml
 ```
-_please see https://github.com/CompVis/taming-transformers if any mirrors fail._
+
+_please see_ [_https://github.com/CompVis/taming-transformers_](https://github.com/CompVis/taming-transformers) _if any mirrors fail._

 The most recent addition from Heidelberg as of this writing is the GumbelVQGAN trained on Open Images:
-```sh
+
+```shell
 wget --continue http://batbot.tv/ai/models/gumbel_f8_8192.ckpt -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/gumbel_f8_8192.ckpt
 wget --continue https://heibox.uni-heidelberg.de/seafhttp/files/a5b2f0d5-bccd-4421-a9a5-864df8659560/model.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/gumbel_f8_8192.yaml
 ```

 ### Queue Training
-Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.
-Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
+
+Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`. Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.

 If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.

 ### Queue Training - Horovod

-Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS.
-`--flops_profiler` will stop training at after 200 steps.
-Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first.
-Change remaining parameters as needed.
+Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. `--flops_profiler` will stop training at after 200 steps. Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first. Change remaining parameters as needed.

 Save the following to a file `horovod_dalle.sbatch`
-```sh
+
+```shell
 #!/usr/bin/env bash
 # horovod_dalle.sbatch
 #SBATCH --nodes 1
@@ -162,27 +164,26 @@ srun -A cstdl --cpu-bind=v \
            --truncate_captions \
            --flops_profiler \
            --distributed_backend="horovod" | tee "$LOGFILE"
-
 ```

 Now you can queue the job:
-```sh
+
+```shell
 tmux
 sbatch horovod_dalle.sbatch
 ```

-Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job:
-tmux-split-pane: `Ctrl-b "`
-`watch squeue --user $USER # check queue status every 2 s`
+Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job: tmux-split-pane: `Ctrl-b "` `watch squeue --user $USER # check queue status every 2 s`

 ### Distributed Training - Horovod
+
 To conduct distributed training across many nodes (each node having up to 4 GPUs), have a look at the script collection [containing examples using Horovod ](https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/projects/large_scale_reproducing/dall-e/dalle-pytorch/-/tree/fzj/scripts).

-For JUWELS Booster for instance, adapt `juwelsbooster.sh` with desired number of nodes. For an example run, you can use `juwelsbooster.sh run_cub.sh`, adapting `run_cub.sh` accordingly.    
+For JUWELS Booster for instance, adapt `juwelsbooster.sh` with desired number of nodes. For an example run, you can use `juwelsbooster.sh run_cub.sh`, adapting `run_cub.sh` accordingly.

 Further up-to-date `sbatch` scripts are in `/p/scratch/ccstdl/ebert1/dalle` (`hvd_dalle.sbatch`, `hvd_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:

-```sh
+```shell
 cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
 ```

@@ -190,13 +191,13 @@ cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch

 There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_dalle.sbatch`, `run_vae.sbatch`). You can copy these into your local DALLE-pytorch clone:

-```sh
+```shell
 cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
 ```

-An example of a run -
-on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
-```sh
+An example of a run - on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
+
+```shell

 #!/usr/bin/env bash

@@ -220,7 +221,6 @@ export WANDB_MODE=disabled

 srun --cpu-bind=v \
     python -u train_dalle.py --image_text_folder "$DATASET_PATH" --deepspeed --fp16
-
 ```

 #### Configure DeepSpeed ZeRO Offload/Infinity
@@ -232,7 +232,9 @@ By default, `DeepSpeed` and `horovod` are very similar. By modifying configurati
 > Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

 In order to change the DeepSpeed stage, find the python dict in `train_dalle.py` with the name `deepspeed_config` and modify as such:
+
 - Stage 1: Optimizer State Partitioning
+
 ```python
 deepspeed_config = {
    "zero_optimization": {
@@ -241,8 +243,8 @@ deepspeed_config = {
 }
 ```

- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning
-Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes
+- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes
+
 ```python
 deepspeed_config = {
    "zero_optimization": {
@@ -255,6 +257,7 @@ deepspeed_config = {
 ```

 - Stage 3: Optimizer + Gradient + Parameter partitioning
+
 ```python
 deepspeed_config = {
    "zero_optimization": {
@@ -271,8 +274,8 @@ deepspeed_config = {
 }
 ```

- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning
-Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
+- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
+
 ```python
 deepspeed_config = {
    "zero_optimization": {
@@ -291,8 +294,8 @@ deepspeed_config = {
 }
 ```

- For a lot more configuration options to tune, see https://www.deepspeed.ai/docs/config-json
+- For a lot more configuration options to tune, see <https://www.deepspeed.ai/docs/config-json>

 ## Monitoring a Job

-You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`.
+You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`.
\ No newline at end of file