diff --git a/README.md b/README.md index 0f997e40007fba9522f2e3928cb8aff910aeebd0..e5728e9f593db1a8c392b605f9edd5057694dc96 100644 --- a/README.md +++ b/README.md @@ -32,13 +32,13 @@ breaks. Please just try again in that case. ### Starting Training -In `./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, adjust the +In `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, adjust the `#SBATCH` variables on top as desired (most interesting is the number of `--nodes`) and execute: ```shell cd run_scripts -sbatch tr1-13B-round1_juwels_pipe.sbatch +sbatch tr11-176B-ml_juwels_pipe.sbatch ``` Please always run the scripts from the `run_scripts` directory. We @@ -49,11 +49,11 @@ also need to change the `GPUS_PER_NODE` variable accordingly, as we do not yet bother with parsing the `SLURM_GRES` value. The script we currently work with, -`./run_scripts/tr1-13B-round1_juwels_pipe.sbatch`, is the oldest +`./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`, is the most recent training sbatch script from the [BigScience documentation -repository](https://github.com/bigscience-workshop/bigscience). This -matches the current data structure we use for testing; a newer version -that assumes later PyTorch versions has different data structure +repository](https://github.com/bigscience-workshop/bigscience). We +patched this to match the current data structure we use for testing; +the original version from BigScience has different data structure requirements due to the many different corpora BigScience is training on. @@ -94,4 +94,4 @@ Variables that need to be set by you: You can do even more runs with the saved checkpoints by editing `DATA_OUTPUT_PATH` in `StartLongRun.bash` and running the script again. -Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `run_scripts/tr1-13B-round1_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know) \ No newline at end of file +Checkpointing happens every `SAVE_INTERVAL` iterations, which is a variable set in `./run_scripts/tr11-176B-ml_juwels_pipe.sbatch`. Checkpointing does not happen automatically at the end of the job runtime. (If this is a feature you request, let us know)