Skip to content
Snippets Groups Projects
Commit a1a617c2 authored by Jan Ebert's avatar Jan Ebert
Browse files

Update README for most recent submission script

parent 167e0078
No related branches found
No related tags found
No related merge requests found
......@@ -32,13 +32,13 @@ breaks. Please just try again in that case.
### Starting Training
In `./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, adjust the
In `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, adjust the
`#SBATCH` variables on top as desired (most interesting is the number
of `--nodes`) and execute:
```shell
cd run_scripts
sbatch tr1-13B-round1_juwels_pipe.sbatch
sbatch tr11-176B-ml_juwels_pipe.sbatch
```
Please always run the scripts from the `run_scripts` directory. We
......@@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do
not yet bother with parsing the `SLURM_GRES` value.
The script we currently work with,
`./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, is the oldest
`./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, is the most recent
training sbatch script from the [BigScience documentation
repository](https://github.com/bigscience-workshop/bigscience). This
matches the current data structure we use for testing; a newer version
that assumes later PyTorch versions has different data structure
repository](https://github.com/bigscience-workshop/bigscience). We
patched this to match the current data structure we use for testing;
the original version from BigScience has different data structure
requirements due to the many different corpora BigScience is training
on.
......@@ -94,4 +94,4 @@ Variables that need to be set by you:
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
\ No newline at end of file
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment