... | ... | @@ -20,7 +20,6 @@ There are several options for supercomputer machines to choose from. Find a [lis |
|
|
We'll now set up your environment so you can start DALLE-pytorch training runs. The setup tutorial briefly mentions the module system at JSC at the bottom. We'll want to use the provided modules as often as possible as they are compiled and optimized for each cluster. As DeepSpeed is currently an experimental package, it is not included in the default meta module. We tell the module system to look at `$OTHERSTAGES` to get additional meta modules:
|
|
|
|
|
|
```sh
|
|
|
|
|
|
ml purge
|
|
|
ml use $OTHERSTAGES
|
|
|
ml Stages/2020
|
... | ... | @@ -32,17 +31,18 @@ ml NCCL/2.8.3-1-CUDA-11.0 |
|
|
ml PyTorch/1.7.0-Python-3.8.5
|
|
|
ml torchvision/0.8.2-Python-3.8.5
|
|
|
ml Horovod/0.20.3-Python-3.8.5
|
|
|
|
|
|
```
|
|
|
|
|
|
Now let's set up DALLE-pytorch and its dependencies. We would like to use a Python `venv` here but currently these cause trouble with DeepSpeed, so we have to install to our user directory:
|
|
|
Now let's set up DALLE-pytorch and its dependencies. We use a Python `venv` here to separate the project Python environment from our user Python environment. If this causes trouble, try again without the `venv`, appending `--user` to the install commands, thus installing to your user directory:
|
|
|
|
|
|
```sh
|
|
|
cd ~/$USER
|
|
|
git clone https://github.com/lucidrains/DALLE-pytorch
|
|
|
cd DALLE-pytorch
|
|
|
python setup.py install --user
|
|
|
python -m pip install --user wandb
|
|
|
python -m venv --system-site-packages env
|
|
|
source env/bin/activate
|
|
|
python setup.py install
|
|
|
python -m pip install wandb
|
|
|
```
|
|
|
|
|
|
For simplicity, we disable WandB in the `sbatch` scripts (see below) as we can't upload its results from the compute nodes (you may look into its offline mode if you are interested in using it).
|
... | ... | |