... | ... | @@ -64,6 +64,12 @@ If everything runs fine, you can change the paths in the `sbatch` scripts accord |
|
|
|
|
|
Currently, there are issues with DeepSpeed and our stack. Therefore, for the time being, you will not be able to train on more than one node (they'll be ignored). So please only choose one node for now. Once the issues are fixed, `sbatch` scripts should not use the `deepspeed` binary to start the script but the `srun` command. Check the `sbatch` scripts at `/p/scratch/ccstdl/ebert1/dalle` for updates.
|
|
|
|
|
|
## (Data Parallel) Training with Horovod
|
|
|
*ToDo*
|
|
|
|
|
|
## Training with DeepSpeed
|
|
|
*ToDo*
|
|
|
|
|
|
## Monitoring a Job
|
|
|
|
|
|
You can interactively attach to a running job via `srun --pty --jobid <job-id> bash`. For example, you may now analyze GPU usage using `nvidia-smi`. |
|
|
\ No newline at end of file |