|
|
# Getting started at JSC
|
|
|
|
|
|
## Setup
|
|
|
|
|
|
1. Sign up at [JuDoor](https://judoor.fz-juelich.de/register).
|
|
|
2. Follow the [Jülich supercomputer setup tutorial](https://gitlab.version.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/intro_scalable_dl_2021/course-material/-/blob/master/tutorials/day1/tutorial1/Tutorial1.ipynb) (there are some unrelated statements in here as this was a course introduction) to be able to SSH to the machines and start compute jobs via Slurm.
|
|
|
3. Your home directory is very limited in space. We thus create a personal directory at another location (`mkdir -p /p/project/ccstdl/$USER`) and link it to the home directory (`ln -s /p/project/ccstdl/$USER ~`). You should now always use `~/$USER` as your "real" home directory. In the same vein, it's also a good idea to move and link `~/.cache`:
|
|
|
`[ -d ~/.cache ] && mv ~/.cache /p/project/ccstdl/$USER/.cache; mkdir -p /p/project/ccstdl/$USER/.cache; ln -s /p/project/$USER/.cache ~`.
|
|
|
4. Use `~/$USER` for code and `mkdir -p /p/scratch/ccstdl/$USER` for temporary (processed) data. Directories in `/p/scratch` are completely wiped every few months, so be careful about leaving important data here.
|
|
|
5. If you want to submit very large datasets, join the [datasets project](https://judoor.fz-juelich.de/projects/datasets/) and have a look at `/p/largedata/datasets`. Please store datasets in `/p/largedata` as individual files, so `tar` and/or compress a large number of files to collect them into one file.
|
|
|
|
|
|
## Cluster List
|
|
|
|
|
|
There are several options for supercomputer machines to choose from. Find a [list of machines here](https://fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/supercomputers_node.html). Select a machine on the list at the right, then select the "Configuration" node that appears under the machine's name on the list at the right. This is where you can see the hardware setup for the cluster's individual modules.
|
|
|
|
|
|
# Training DALL-E
|
|
|
|
|
|
## Setting up DALLE-pytorch
|
|
|
|
|
|
We'll now set up your environment so you can start DALLE-pytorch training runs. The setup tutorial briefly mentions the module system at JSC at the bottom. We'll want to use the provided modules as often as possible as they are compile and optimized for each cluster. As DeepSpeed is currently an experimental package, it is not included in the default meta module. We tell the module system to look at `$OTHERSTAGES` to get additional meta modules:
|
|
|
|
|
|
```sh
|
|
|
module use $OTHERSTAGES
|
|
|
module purge
|
|
|
module load Stages/Devel-2020 GCC OpenMPI DeepSpeed
|
|
|
```
|
|
|
|
|
|
Now let's set up DALLE-pytorch and its dependencies. We would like to use a Python `venv` here but currently these cause trouble with DeepSpeed, so we have to install to our user directory:
|
|
|
|
|
|
```sh
|
|
|
cd ~/$USER
|
|
|
git clone https://github.com/lucidrains/DALLE-pytorch
|
|
|
cd DALLE-pytorch
|
|
|
python setup.py install --user
|
|
|
python -m pip install --user wandb
|
|
|
```
|
|
|
|
|
|
For simplicity, we disable WandB as we can't upload its results from the compute nodes (you may look into its offline mode if you are interested in using it):
|
|
|
|
|
|
```sh
|
|
|
sed -i '/wandb\.init(/a \
|
|
|
mode = '"'disabled'," train_*.py
|
|
|
```
|
|
|
|
|
|
We do not have internet access from the compute nodes, so we cannot download checkpoints while our script is running. You can find common material in `/p/scratch/ccstdl/ebert1/dalle`, including downloaded checkpoints. We link these so DALLE-pytorch can find them at the expected location:
|
|
|
|
|
|
```sh
|
|
|
mkdir -p ~/.cache/dalle
|
|
|
ln -s /p/scratch/ccstdl/ebert1/dalle/checkpoints/* ~/.cache/dalle
|
|
|
```
|
|
|
|
|
|
Additionally, there are up-to-date `sbatch` scripts in `/p/scratch/ccstdl/ebert1/dalle`. Copy these into your local DALLE-pytorch clone:
|
|
|
|
|
|
```sh
|
|
|
cp /p/scratch/ccstdl/ebert1/dalle/*.sbatch ~/$USER/DALLE-pytorch
|
|
|
```
|
|
|
|
|
|
## Starting a Training Job
|
|
|
|
|
|
Depending on the supercomputer you are on, you have to change the `--partition` in the `sbatch` script you want to use. The `sinfo` command lists all partitions for the machine you are on; look out for names like `develgpus` or `develbooster`.
|
|
|
Once the partitions are configured correctly, you should be able to start a DALLE-pytorch training job using `sbatch <script.sbatch>`! These will use an example dataset also located at `/p/scratch/ccstdl/ebert1/dalle`.
|
|
|
|
|
|
If everything runs fine, you can change the paths in the `sbatch` scripts according to your locations.
|
|
|
|
|
|
Currently, there are issues with DeepSpeed and our stack. Therefore, for the time being, you will not be able to train on more than one node (they'll be ignored). So please only choose one node for now. Once the issues are fixed, `sbatch` scripts should not use the `deepspeed` binary to start the script but the `srun` command. Check the `sbatch` scripts at `/p/scratch/ccstdl/ebert1/dalle` for updates.
|
|
|
|
|
|
## Monitoring a Job
|
|
|
|
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
|
\ No newline at end of file |