guide and setup your environment in the JUWELS supercomputer, if you have not yet. Then
- Clone this repository
- make required changes in `variables.bash`
- make required location changes in `variables.bash`
- execute
```
nice bash setup.bash
...
...
@@ -24,7 +23,7 @@ Make required changes in the `jobscript.sh` like adjusting the `#SBATCH` variabl
```
sbatch jobscript.sh
```
**WARNING** : PyTorch >= 1.11 will complain about not being able to handle some address families and tell you that sockets are invalid. This does **not** hinder the code from scaling according to the number of total GPUs.
**WARNING** : PyTorch >= 1.11 will throw warnings about client socket initializations and `(errno: 97 - Address family not supported by protocol)`. This so far has **not** hindered the code from scaling to the total number of GPUs assigned.
To work interactively, please activate the environment like this:
To work interactively, please activate the environment using the following command:
```
source activate.bash
...
...
@@ -59,7 +58,10 @@ environment, and set the variables specified in `variables.bash`.
- JUWELS Cluster
- JUWELS Booster
Supported means tested and the correct CUDA compute architecture will
be selected. Other machines can easily be supported by adjusting
`activate.bash`.
Other machines can easily be supported by adjusting `activate.bash` and setting the correct CUDA architecture.
## Tested Models
Test runs for 15-30 mins were performed on the follwoing models to train from scratch using the [OSCAR](https://huggingface.co/bigscience/misc-test-data/tree/main/stas) dataset.