Jan Ebert · 267f57aa
--- a/Home.md
+++ b/Home.md
@@ -4,10 +4,11 @@

 1. Sign up at [JuDoor](https://judoor.fz-juelich.de/register).
 2. Follow the [Jülich supercomputer setup tutorial](https://gitlab.version.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_dl_2021/course-material/-/blob/master/tutorials/day1/tutorial1/Tutorial1.ipynb) (there are some unrelated statements in here as this was a course introduction) to be able to SSH to the machines and start compute jobs via Slurm.
-3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`: `[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
+3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`:  
+`[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/ccstdl/$USER/.cache ~`.
 4. Use `~/$USER` for code and `mkdir -p /p/scratch/ccstdl/$USER` for temporary (processed) data. Directories in `/p/scratch` are completely wiped every few months, so be careful about leaving important data here.
-5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.
-   * for the model training, put datasets into `/p/scratch/ccstdl/$USER`, as compute nodes do not have access to `/p/largedata` pathes. Those are meant for getting data over to the machines and storage only.
+5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.  
+For training, copy the datasets into `/p/scratch/ccstdl/$USER` as compute nodes do not have access to `/p/largedata`. `/p/scratch/ccstdl/$USER` is also a good location for smaller datasets.

 ## Cluster List

@@ -72,7 +73,7 @@ wget --continue http://batbot.tv/ai/models/imagenet_16384_slim.ckpt -o /p/scratc
 wget --continue http://batbot.tv/ai/models/imagenet_16384.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/imagenet_16384.yaml
 ```

-_please see_ [_https://github.com/CompVis/taming-transformers_](https://github.com/CompVis/taming-transformers) _if any mirrors fail._
+_Please see [https://github.com/CompVis/taming-transformers](https://github.com/CompVis/taming-transformers) if any mirrors fail._

 The most recent addition from Heidelberg as of this writing is the GumbelVQGAN trained on Open Images:

@@ -83,10 +84,17 @@ wget --continue https://heibox.uni-heidelberg.de/seafhttp/files/a5b2f0d5-bccd-44

 ### Queue Training

-Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`. Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
+Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.  
+Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.

 If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.

+### Monitoring a Job
+
+You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`.
+
+## Advaned Configuration
+
 ### Queue Training - Horovod

 Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. `--flops_profiler` will stop training at after 200 steps. Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first. Change remaining parameters as needed.
@@ -173,7 +181,10 @@ tmux
 sbatch horovod_dalle.sbatch
 ```

-Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job: tmux-split-pane: `Ctrl-b "` `watch squeue --user $USER # check queue status every 2 s`
+Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job:
+
+1. tmux-split-pane: `Ctrl-b "` 
+2. `watch squeue --user $USER  # check queue status every 2 s`

 ### Distributed Training - Horovod

@@ -195,7 +206,7 @@ There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle` (`run_
 cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
 ```

-An example of a run - on two nodes across 8xV100 using DeepSpeed (zero-optimization disabled)
+An example of a run – on two nodes across 8xV100 using DeepSpeed (ZeRO-optimization disabled):

 ```shell

@@ -243,7 +254,9 @@ deepspeed_config = {
 }
 ```

- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes
+- Stage 2 (aka ZeRO-Offload): Optimizer + Gradient State Partitioning
+
+Offload the optimizer (e.g. Adam) state to the CPU and partition across GPUs/nodes.

 ```python
 deepspeed_config = {
@@ -274,7 +287,9 @@ deepspeed_config = {
 }
 ```

- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.
+- DeepSpeed ZeRO Infinity (Requires NVMe drive): Optimizer + Gradient + Parameter + Checkpoint partitioning
+
+Takes advantage of fast read/write to NVMe drives to offload the optimizer state to the CPU and partition across GPUs/nodes.

 ```python
 deepspeed_config = {
@@ -294,8 +309,4 @@ deepspeed_config = {
 }
 ```

- For a lot more configuration options to tune, see <https://www.deepspeed.ai/docs/config-json>
-
-## Monitoring a Job
-
-You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`.
\ No newline at end of file
+- For a lot more configuration options to tune, see <https://www.deepspeed.ai/docs/config-json>.