Skip to content
Snippets Groups Projects
Select Git revision
  • main default protected
  • add_automatic_checkpoint_and_restart
2 results

bigscience-code

  • Clone with SSH
  • Clone with HTTPS
  • OpenGPT-X with BigScience

    This repository contains set-up code for working on JSC machines with the BigScience codebase.

    Aims

    We aim to support a set-up for your own, specialized working environment. This means we do not fix versions or commits; this would have to be done on an individual/subgroup level.

    Getting Started

    Setting Up

    We assume you have already set up your environment on the supercomputer. If you have not, please see the "Getting started at JSC" guide. guide.

    First off, clone this repository to a destination of your choice. Adjust variables in ./run_scripts/variables.bash and execute:

    cd run_scripts
    nice bash set_up.bash

    For weird reasons, sometimes Bash parses incorrectly and the script breaks. Please just try again in that case.

    Starting Training

    In ./run_scripts/tr11-176B-ml_juwels_pipe.sbatch, adjust the #SBATCH variables on top as desired (most interesting is the number of --nodes) and execute:

    cd run_scripts
    sbatch tr11-176B-ml_juwels_pipe.sbatch

    Please always run the scripts from the run_scripts directory. We have not yet made them execution-location-independent.

    Care needs to be taken when changing the number of GPUs per node. You also need to change the GPUS_PER_NODE variable accordingly, as we do not yet bother with parsing the SLURM_GRES value.

    The script we currently work with, ./run_scripts/tr11-176B-ml_juwels_pipe.sbatch, is the most recent training sbatch script from the BigScience documentation repository. We patched this to match the current data structure we use for testing; the original version from BigScience has different data structure requirements due to the many different corpora BigScience is training on.

    PyTorch >= 1.11 will complain about not being able to handle some address families and tell you that sockets are invalid. This does not hinder the code from scaling according to the number of total GPUs.

    Interactive Usage

    We use a certain environment setup to handle our software stack. To work interactively, please activate the environment like this:

    cd run_scripts
    source activate.bash

    This will load the modules we use, activate the Python virtual environment, and set the variables you specified in variables.bash.

    Supported Machines

    • JUWELS Cluster
    • JUWELS Booster

    Supported means tested and the correct CUDA compute architecture will be selected. Other machines can easily be supported by adjusting ./run_scripts/activate.bash.

    Training with checkpoints (longer than 24 hours)

    You may make use of the executable StartLongRun.bash script in ./run_scripts in order to send multiple jobs to the batch system, that will start only after the previous job has finished and use the checkpoints that were left by that jobs. Have a look at StartLongRun.bash, see what it is doing and change the variables accordingly, do not use it blindly!

    Variables that need to be set by you:

    • RUNTIME_PER_JOB: How long should each job run. This value should be a maximum of 23:59:59 when using the booster partition, because 24 hours is the maximum runtime here.
    • NUM_JOBS: How many jobs should be submitted.

    You can do even more runs with the saved checkpoints by editing DATA_OUTPUT_PATH in StartLongRun.bash and running the script again.

    Checkpointing happens every SAVE_INTERVAL iterations, which is a variable set in ./run_scripts/tr11-176B-ml_juwels_pipe.sbatch. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)