extended README for using StartLongJobs,bash,

deleted commented unused, confusing line

extended README for using StartLongJobs,bash,
470184c0 · Carolin Penke · 7128f3ca · 470184c0 · 470184c0
Commit 470184c0 authored Jun 13, 2022 by Carolin Penke
--- a/README.md
+++ b/README.md
@@ -83,3 +83,15 @@ environment, and set the variables you specified in `variables.bash`.
 Supported means tested and the correct CUDA compute architecture will
 be selected. Other machines can easily be supported by adjusting
 `./run_scripts/activate.bash`.
+
+## Training with checkpoints (longer than 24 hours)
+You may make use of the executable `StartLongRun.bash` script in `./run_scripts` in order to send multiple jobs to the batch system, that will start only after the previous job has finished and use the checkpoints that were left by that jobs.
+Have a look at `StartLongRun.bash`, see what it is doing and change the variables accordingly, do not use it blindly!
+
+Variables that need to be set by you:
+- `RUNTIME_PER_JOB`: How long should each job run. This value should be a maximum of 23:59:59 when using the `booster` partition, because 24 hours is the maximum runtime here.
+- `NUM_JOBS`: How many jobs should be submitted. 
+
+You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again. 
+
+Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
\ No newline at end of file
--- a/run_scripts/tr1-13B-round1_juwels_pipe.sbatch
+++ b/run_scripts/tr1-13B-round1_juwels_pipe.sbatch
@@ -6,7 +6,6 @@
 #SBATCH --hint=nomultithread         # we get physical cores not logical
 #SBATCH --gres=gpu:4                 # number of gpus
 #SBATCH --time=00:10:00              # maximum execution time (HH:MM:SS)
-##SBATCH --time=${RUNTIME_PER_JOB:"-00:10:00"}     # maximum execution time (HH:MM:SS)
 #SBATCH --output=%x-%j.out           # output file name
 #SBATCH --account=opengptx-elm
 # Use `develbooster` for debugging, `booster` for "normal" jobs, and