Clayton Mullis · 5808af05
--- a/Home.md
+++ b/Home.md
@@ -43,8 +43,6 @@ There are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle`. Copy
 cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
 ```

-## Train DALLE-pytorch
-
 ### Choose a Variational Autoencoder
 The VAE is responsible for representing images efficiently via pretraining. You can use a VAE released by OpenAI, the various flavors of VQGAN from Heidelberg, or you can train your own discrete VAE from scratch.

@@ -69,15 +67,13 @@ wget --continue http://batbot.tv/ai/models/gumbel_f8_8192.ckpt -o /p/scratch/ccs
 wget --continue https://heibox.uni-heidelberg.de/seafhttp/files/a5b2f0d5-bccd-4421-a9a5-864df8659560/model.yaml -o /p/scratch/ccstdl/${HOME_PATH}/vqgan_models/gumbel_f8_8192.yaml
 ```

-## Running jobs
+### Queue Training
 Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.
 Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.

 If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.

-## (Data Parallel)
-
-### Horovod
+### Queue Training - Horovod

 Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. 
 `--flops_profiler` will stop training at after 200 steps.
@@ -148,7 +144,7 @@ srun -A cstdl --cpu-bind=none \

 ```

-### DeepSpeed
+### Queue Training - DeepSpeed

 Run 200 steps across 4xV100 using DeepSpeed (zero-optimization disabled)
 ```sh
@@ -181,6 +177,11 @@ deepspeed train_dalle.py \
 ```

 #### Configure DeepSpeed ZeRO Offload/Infinity
+
+DeepSpeed will allow you to increase the total parameter count of your model beyond what can fit in a single GPU typically.
+
+By default, `DeepSpeed` and `horovod` are very similar. By modifying configuration, DeepSpeed enables various degrees of network/optimizer CPU offloading and clever state partition across multiple GPU's and/or nodes. ZeRO stage 3 is recommended for the full benefits of training on the supercomputer. Using this configuration with 16-bit precision is presently as close as DALLE-pytorch gets to the training regime used by OpenAI in DALL-E.
+
 > Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

 In order to change the DeepSpeed stage, find the python dict in `train_dalle.py` with the name `deepspeed_config` and modify as such: