Skip to content
Snippets Groups Projects
Commit a1a617c2 authored by Jan Ebert's avatar Jan Ebert
Browse files

Update README for most recent submission script

parent 167e0078
Branches main
No related tags found
No related merge requests found
...@@ -32,13 +32,13 @@ breaks. Please just try again in that case. ...@@ -32,13 +32,13 @@ breaks. Please just try again in that case.
### Starting Training ### Starting Training
In `./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, adjust the In `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, adjust the
`#SBATCH` variables on top as desired (most interesting is the number `#SBATCH` variables on top as desired (most interesting is the number
of `--nodes`) and execute: of `--nodes`) and execute:
```shell ```shell
cd run_scripts cd run_scripts
sbatch tr1-13B-round1_juwels_pipe.sbatch sbatch tr11-176B-ml_juwels_pipe.sbatch
``` ```
Please always run the scripts from the `run_scripts` directory. We Please always run the scripts from the `run_scripts` directory. We
...@@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do ...@@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do
not yet bother with parsing the `SLURM_GRES` value. not yet bother with parsing the `SLURM_GRES` value.
The script we currently work with, The script we currently work with,
`./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, is the oldest `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, is the most recent
training sbatch script from the [BigScience documentation training sbatch script from the [BigScience documentation
repository](https://github.com/bigscience-workshop/bigscience). This repository](https://github.com/bigscience-workshop/bigscience). We
matches the current data structure we use for testing; a newer version patched this to match the current data structure we use for testing;
that assumes later PyTorch versions has different data structure the original version from BigScience has different data structure
requirements due to the many different corpora BigScience is training requirements due to the many different corpora BigScience is training
on. on.
...@@ -94,4 +94,4 @@ Variables that need to be set by you: ...@@ -94,4 +94,4 @@ Variables that need to be set by you:
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again. You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know) Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment