@@ -83,3 +83,15 @@ environment, and set the variables you specified in `variables.bash`.
Supported means tested and the correct CUDA compute architecture will
be selected. Other machines can easily be supported by adjusting
`./run_scripts/activate.bash`.
## Training with checkpoints (longer than 24 hours)
You may make use of the executable `StartLongRun.bash` script in `./run_scripts` in order to send multiple jobs to the batch system, that will start only after the previous job has finished and use the checkpoints that were left by that jobs.
Have a look at `StartLongRun.bash`, see what it is doing and change the variables accordingly, do not use it blindly!
Variables that need to be set by you:
-`RUNTIME_PER_JOB`: How long should each job run. This value should be a maximum of 23:59:59 when using the `booster` partition, because 24 hours is the maximum runtime here.
-`NUM_JOBS`: How many jobs should be submitted.
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)