@@ -32,13 +32,13 @@ breaks. Please just try again in that case.
...
@@ -32,13 +32,13 @@ breaks. Please just try again in that case.
### Starting Training
### Starting Training
In `./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, adjust the
In `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, adjust the
`#SBATCH` variables on top as desired (most interesting is the number
`#SBATCH` variables on top as desired (most interesting is the number
of `--nodes`) and execute:
of `--nodes`) and execute:
```shell
```shell
cd run_scripts
cd run_scripts
sbatch tr1-13B-round1_juwels_pipe.sbatch
sbatch tr11-176B-ml_juwels_pipe.sbatch
```
```
Please always run the scripts from the `run_scripts` directory. We
Please always run the scripts from the `run_scripts` directory. We
...
@@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do
...
@@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do
not yet bother with parsing the `SLURM_GRES` value.
not yet bother with parsing the `SLURM_GRES` value.
The script we currently work with,
The script we currently work with,
`./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, is the oldest
`./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, is the most recent
training sbatch script from the [BigScience documentation
training sbatch script from the [BigScience documentation
repository](https://github.com/bigscience-workshop/bigscience). This
repository](https://github.com/bigscience-workshop/bigscience). We
matches the current data structure we use for testing; a newer version
patched this to match the current data structure we use for testing;
that assumes later PyTorch versions has different data structure
the original version from BigScience has different data structure
requirements due to the many different corpora BigScience is training
requirements due to the many different corpora BigScience is training
on.
on.
...
@@ -94,4 +94,4 @@ Variables that need to be set by you:
...
@@ -94,4 +94,4 @@ Variables that need to be set by you:
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again.
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)
Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)