... | ... | @@ -79,9 +79,11 @@ Example: Run on a 4xV100 dev instance (2 hour time limit) on JUWELS. |
|
|
`--flops_profiler` will stop training at after 200 steps.
|
|
|
Fill out `HOME_PATH`, `CHECKPOINT_NAME`, and `LOGS_PATH` first.
|
|
|
Change remaining parameters as needed.
|
|
|
|
|
|
Save the following to a file `horovod_dalle.sbatch`
|
|
|
```sh
|
|
|
#!/usr/bin/env bash
|
|
|
|
|
|
# horovod_dalle.sbatch
|
|
|
#SBATCH --nodes 1
|
|
|
#SBATCH --tasks-per-node 4
|
|
|
#SBATCH --gres gpu
|
... | ... | @@ -144,6 +146,16 @@ srun -A cstdl --cpu-bind=none \ |
|
|
|
|
|
```
|
|
|
|
|
|
Now you can queue the job:
|
|
|
```sh
|
|
|
tmux
|
|
|
sbatch horovod_dalle.sbatch
|
|
|
```
|
|
|
|
|
|
Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job:
|
|
|
tmux-split-pane: `Ctrl-b "`
|
|
|
`watch squeue --user $USER # check queue status every 2 s`
|
|
|
|
|
|
### Queue Training - DeepSpeed
|
|
|
|
|
|
Run 200 steps across 4xV100 using DeepSpeed (zero-optimization disabled)
|
... | ... | |