... | ... | @@ -22,7 +22,7 @@ We'll now set up your environment so you can start DALLE-pytorch training runs. |
|
|
```sh
|
|
|
module use $OTHERSTAGES
|
|
|
module purge
|
|
|
module load Stages/Devel-2020 GCC OpenMPI DeepSpeed
|
|
|
module load Stages/Devel-2020 GCC/9.3.0 OpenMPI DeepSpeed
|
|
|
```
|
|
|
|
|
|
Now let's set up DALLE-pytorch and its dependencies. We would like to use a Python `venv` here but currently these cause trouble with DeepSpeed, so we have to install to our user directory:
|
... | ... | @@ -35,12 +35,7 @@ python setup.py install --user |
|
|
python -m pip install --user wandb
|
|
|
```
|
|
|
|
|
|
For simplicity, we disable WandB as we can't upload its results from the compute nodes (you may look into its offline mode if you are interested in using it):
|
|
|
|
|
|
```sh
|
|
|
sed -i '/wandb\.init(/a \
|
|
|
mode = '"'disabled'," train_*.py
|
|
|
```
|
|
|
For simplicity, we disable WandB in the `sbatch` scripts (see below) as we can't upload its results from the compute nodes (you may look into its offline mode if you are interested in using it).
|
|
|
|
|
|
We do not have internet access from the compute nodes, so we cannot download checkpoints while our script is running. You can find common material in `/p/scratch/ccstdl/ebert1/dalle`, including downloaded checkpoints. We link these so DALLE-pytorch can find them at the expected location:
|
|
|
|
... | ... | @@ -62,8 +57,6 @@ Once the partitions are configured correctly, you should be able to start a DALL |
|
|
|
|
|
If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.
|
|
|
|
|
|
Currently, there are issues with DeepSpeed and our stack. Therefore, for the time being, you will not be able to train on more than one node (they'll be ignored). So please only choose one node for now. Once the issues are fixed, `sbatch` scripts should not use the `deepspeed` binary to start the script but the `srun` command. Check the `sbatch` scripts at `/p/scratch/ccstdl/ebert1/dalle` for updates.
|
|
|
|
|
|
## (Data Parallel) Training with Horovod
|
|
|
*ToDo*
|
|
|
|
... | ... | |