... | ... | @@ -14,6 +14,10 @@ For training, copy the datasets into `/p/scratch/ccstdl/$USER` as compute nodes |
|
|
|
|
|
There are several options for supercomputer machines to choose from. Find a [list of machines here](https://fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/supercomputers_node.html). Select a machine on the list at the right, then select the "Configuration" node that appears under the machine's name on the list at the right. This is where you can see the hardware setup for the cluster's individual modules.
|
|
|
|
|
|
# Training OpenCLIP
|
|
|
|
|
|
For detailed description on OpenCLIP, please follow [OpenCLIP Guide.](OpenCLIP_Guide.md)
|
|
|
|
|
|
# Training DALL-E
|
|
|
|
|
|
## Setting up DALLE-pytorch
|
... | ... | @@ -182,7 +186,7 @@ sbatch horovod_dalle.sbatch |
|
|
|
|
|
Depending on current load/priority, it may take a few minutes before your job is queued. To check the current status of your job:
|
|
|
|
|
|
1. tmux-split-pane: `Ctrl-b "`
|
|
|
1. tmux-split-pane: `Ctrl-b "`
|
|
|
2. `watch squeue --user $USER # check queue status every 2 s`
|
|
|
|
|
|
### Distributed Training – Horovod
|
... | ... | |