Skip to content
Snippets Groups Projects
Commit 470184c0 authored by Carolin Penke's avatar Carolin Penke
Browse files

extended README for using StartLongJobs,bash,

deleted commented unused, confusing line
parent 7128f3ca
No related branches found
No related tags found
1 merge request!1changed paths to opengptx-elm and added StartLongRun.bash to start multiple...
......@@ -83,3 +83,15 @@ environment, and set the variables you specified in `variables.bash`.
Supported means tested and the correct CUDA compute architecture will
be selected. Other machines can easily be supported by adjusting
`./run_scripts/activate.bash`.
## Training with checkpoints (longer than 24 hours)
You may make use of the executable `StartLongRun.bash` script in `./run_scripts` in order to send multiple jobs to the batch system, that will start only after the previous job has finished and use the checkpoints that were left by that jobs.
Have a look at `StartLongRun.bash`, see what it is doing and change the variables accordingly, do not use it blindly!
Variables that need to be set by you:
- `RUNTIME_PER_JOB`: How long should each job run. This value should be a maximum of 23:59:59 when using the `booster` partition, because 24 hours is the maximum runtime here.
- `NUM_JOBS`: How many jobs should be submitted.
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
\ No newline at end of file
......@@ -6,7 +6,6 @@
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --gres=gpu:4 # number of gpus
#SBATCH --time=00:10:00 # maximum execution time (HH:MM:SS)
##SBATCH --time=${RUNTIME_PER_JOB:"-00:10:00"} # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --account=opengptx-elm
# Use `develbooster` for debugging, `booster` for "normal" jobs, and
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment