diff --git a/README.md b/README.md index 9ac6be7c0a1c6decc1878e5c9d287a3e329eeae9..6033866ff68dc255318a0ea511899a00e603efd3 100644 --- a/README.md +++ b/README.md @@ -6,13 +6,12 @@ the forked [Meta OPT codebase](https://github.com/chelseajohn/metaseq.git). ## Getting Started ### Set up -Assuming you have already set up your environment on the -supercomputer. If you have not, please see the ["Getting started at +Please see the ["Getting started at JSC"](https://gitlab.jsc.fz-juelich.de/opengptx/infos-public/-/blob/main/documentation/getting_started_at_JSC.md) -guide. Then +guide and setup your environment in the JUWELS supercomputer, if you have not yet. Then - Clone this repository -- make required changes in `variables.bash` +- make required location changes in `variables.bash` - execute ``` nice bash setup.bash @@ -24,7 +23,7 @@ Make required changes in the `jobscript.sh` like adjusting the `#SBATCH` variabl ``` sbatch jobscript.sh ``` -**WARNING** : PyTorch >= 1.11 will complain about not being able to handle some address families and tell you that sockets are invalid. This does **not** hinder the code from scaling according to the number of total GPUs. +**WARNING** : PyTorch >= 1.11 will throw warnings about client socket initializations and `(errno: 97 - Address family not supported by protocol)`. This so far has **not** hindered the code from scaling to the total number of GPUs assigned. ### Launch tensorboard for the run @@ -44,7 +43,7 @@ tensorboard serve --logdir="INSERT_TENSORBOARD_LOGDIR" --bind_all ## Interactive Usage -To work interactively, please activate the environment like this: +To work interactively, please activate the environment using the following command: ``` source activate.bash @@ -59,7 +58,10 @@ environment, and set the variables specified in `variables.bash`. - JUWELS Cluster - JUWELS Booster -Supported means tested and the correct CUDA compute architecture will -be selected. Other machines can easily be supported by adjusting -`activate.bash`. +Other machines can easily be supported by adjusting `activate.bash` and setting the correct CUDA architecture. + +## Tested Models +Test runs for 15-30 mins were performed on the follwoing models to train from scratch using the [OSCAR](https://huggingface.co/bigscience/misc-test-data/tree/main/stas) dataset. +- 125m model +- 30b model