Skip to content
Snippets Groups Projects

README

1 file
+ 11
9
Compare changes
  • Side-by-side
  • Inline
+ 11
9
@@ -6,13 +6,12 @@ the forked [Meta OPT codebase](https://github.com/chelseajohn/metaseq.git).
## Getting Started
### Set up
Assuming you have already set up your environment on the
supercomputer. If you have not, please see the ["Getting started at
Please see the ["Getting started at
JSC"](https://gitlab.jsc.fz-juelich.de/opengptx/infos-public/-/blob/main/documentation/getting_started_at_JSC.md)
guide. Then
guide and setup your environment in the JUWELS supercomputer, if you have not yet. Then
- Clone this repository
- make required changes in `variables.bash`
- make required location changes in `variables.bash`
- execute
```
nice bash setup.bash
@@ -24,7 +23,7 @@ Make required changes in the `jobscript.sh` like adjusting the `#SBATCH` variabl
```
sbatch jobscript.sh
```
**WARNING** : PyTorch >= 1.11 will complain about not being able to handle some address families and tell you that sockets are invalid. This does **not** hinder the code from scaling according to the number of total GPUs.
**WARNING** : PyTorch >= 1.11 will throw warnings about client socket initializations and `(errno: 97 - Address family not supported by protocol)`. This so far has **not** hindered the code from scaling to the total number of GPUs assigned.
### Launch tensorboard for the run
@@ -44,7 +43,7 @@ tensorboard serve --logdir="INSERT_TENSORBOARD_LOGDIR" --bind_all
## Interactive Usage
To work interactively, please activate the environment like this:
To work interactively, please activate the environment using the following command:
```
source activate.bash
@@ -59,7 +58,10 @@ environment, and set the variables specified in `variables.bash`.
- JUWELS Cluster
- JUWELS Booster
Supported means tested and the correct CUDA compute architecture will
be selected. Other machines can easily be supported by adjusting
`activate.bash`.
Other machines can easily be supported by adjusting `activate.bash` and setting the correct CUDA architecture.
## Tested Models
Test runs for 15-30 mins were performed on the follwoing models to train from scratch using the [OSCAR](https://huggingface.co/bigscience/misc-test-data/tree/main/stas) dataset.
- 125m model
- 30b model
Loading