@@ -32,13 +32,13 @@ breaks. Please just try again in that case.
### Starting Training
In `./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, adjust the
In `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, adjust the
`#SBATCH` variables on top as desired (most interesting is the number
of `--nodes`) and execute:
```shell
cd run_scripts
sbatch tr1-13B-round1_juwels_pipe.sbatch
sbatch tr11-176B-ml_juwels_pipe.sbatch
```
Please always run the scripts from the `run_scripts` directory. We
...
...
@@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do
not yet bother with parsing the `SLURM_GRES` value.
The script we currently work with,
`./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, is the oldest
`./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, is the most recent
training sbatch script from the [BigScience documentation
repository](https://github.com/bigscience-workshop/bigscience). This
matches the current data structure we use for testing; a newer version
that assumes later PyTorch versions has different data structure
repository](https://github.com/bigscience-workshop/bigscience). We
patched this to match the current data structure we use for testing;
the original version from BigScience has different data structure
requirements due to the many different corpora BigScience is training
on.
...
...
@@ -94,4 +94,4 @@ Variables that need to be set by you:
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
\ No newline at end of file
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)