OpenGPT-X with BigScience
This repository contains set-up code for working on JSC machines with the BigScience codebase.
Aims
We aim to support a set-up for your own, specialized working environment. This means we do not fix versions or commits; this would have to be done on an individual/subgroup level.
Getting Started
Setting Up
We assume you have already set up your environment on the supercomputer. If you have not, please see the "Getting started at JSC" guide. guide.
First off, clone this repository to a destination of your choice.
Adjust variables in ./run_scripts/variables.bash
and execute:
cd run_scripts
nice bash set_up.bash
For weird reasons, sometimes Bash parses incorrectly and the script breaks. Please just try again in that case.
Starting Training
In ./run_scripts/tr11-176B-ml_juwels_pipe.sbatch
, adjust the
#SBATCH
variables on top as desired (most interesting is the number
of --nodes
) and execute:
cd run_scripts
sbatch tr11-176B-ml_juwels_pipe.sbatch
Please always run the scripts from the run_scripts
directory. We
have not yet made them execution-location-independent.
Care needs to be taken when changing the number of GPUs per node. You
also need to change the GPUS_PER_NODE
variable accordingly, as we do
not yet bother with parsing the SLURM_GRES
value.
The script we currently work with,
./run_scripts/tr11-176B-ml_juwels_pipe.sbatch
, is the most recent
training sbatch script from the BigScience documentation
repository. We
patched this to match the current data structure we use for testing;
the original version from BigScience has different data structure
requirements due to the many different corpora BigScience is training
on.
PyTorch >= 1.11 will complain about not being able to handle some address families and tell you that sockets are invalid. This does not hinder the code from scaling according to the number of total GPUs.
Interactive Usage
We use a certain environment setup to handle our software stack. To work interactively, please activate the environment like this:
cd run_scripts
source activate.bash
This will load the modules we use, activate the Python virtual
environment, and set the variables you specified in variables.bash
.
Supported Machines
- JUWELS Cluster
- JUWELS Booster
Supported means tested and the correct CUDA compute architecture will
be selected. Other machines can easily be supported by adjusting
./run_scripts/activate.bash
.
Training with checkpoints (longer than 24 hours)
You may make use of the executable StartLongRun.bash
script in ./run_scripts
in order to send multiple jobs to the batch system, that will start only after the previous job has finished and use the checkpoints that were left by that jobs.
Have a look at StartLongRun.bash
, see what it is doing and change the variables accordingly, do not use it blindly!
Variables that need to be set by you:
-
RUNTIME_PER_JOB
: How long should each job run. This value should be a maximum of 23:59:59 when using thebooster
partition, because 24 hours is the maximum runtime here. -
NUM_JOBS
: How many jobs should be submitted.
You can do even more runs with the saved checkpoints by editing DATA_OUTPUT_PATH
in StartLongRun.bash
and running the script again.
Checkpointing happens every SAVE_INTERVAL
iterations, which is a variable set in ./run_scripts/tr11-176B-ml_juwels_pipe.sbatch
. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)