diff --git a/README.md b/README.md index 69ec6983c32e1deb9e1627e8a746931b0be6ef08..0f997e40007fba9522f2e3928cb8aff910aeebd0 100644 --- a/README.md +++ b/README.md @@ -83,3 +83,15 @@ environment, and set the variables you specified in `variables.bash`. Supported means tested and the correct CUDA compute architecture will be selected. Other machines can easily be supported by adjusting `./run_scripts/activate.bash`. + +## Training with checkpoints (longer than 24 hours) +You may make use of the executable `StartLongRun.bash` script in `./run_scripts` in order to send multiple jobs to the batch system, that will start only after the previous job has finished and use the checkpoints that were left by that jobs. +Have a look at `StartLongRun.bash`, see what it is doing and change the variables accordingly, do not use it blindly! + +Variables that need to be set by you: +- `RUNTIME_PER_JOB`: How long should each job run. This value should be a maximum of 23:59:59 when using the `booster` partition, because 24 hours is the maximum runtime here. +- `NUM_JOBS`: How many jobs should be submitted. + +You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again. + +Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know) \ No newline at end of file diff --git a/run_scripts/tr1-13B-round1_juwels_pipe.sbatch b/run_scripts/tr1-13B-round1_juwels_pipe.sbatch index ee048e6811c1828ec461e80ed3e15609122550fd..09192640f06bfc82f01eb2b09de48f2599b55b26 100644 --- a/run_scripts/tr1-13B-round1_juwels_pipe.sbatch +++ b/run_scripts/tr1-13B-round1_juwels_pipe.sbatch @@ -6,7 +6,6 @@ #SBATCH --hint=nomultithread # we get physical cores not logical #SBATCH --gres=gpu:4 # number of gpus #SBATCH --time=00:10:00 # maximum execution time (HH:MM:SS) -##SBATCH --time=${RUNTIME_PER_JOB:"-00:10:00"} # maximum execution time (HH:MM:SS) #SBATCH --output=%x-%j.out # output file name #SBATCH --account=opengptx-elm # Use `develbooster` for debugging, `booster` for "normal" jobs, and