Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
B
BigScience Code
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
OpenGPT-X
BigScience Code
Repository graph
Repository graph
You can move around the graph by using the arrow keys.
add_automatic_checkpoint_and_restart
Select Git revision
Branches
2
add_automatic_checkpoint_and_restart
main
default
protected
2 results
Begin with the selected commit
Created with Raphaël 2.2.0
21
Jul
8
7
23
Jun
21
13
31
May
30
20
Apr
13
Update README for most recent submission script
main
main
Add 176B training script
Handle patch application fail better
Update preprocess_data.sbatch, ntasks to 1
Remove DeepSpeed commit hash specification
Fix error propagations
Fix path to patch file
Fix quitting upon success
Improve error handling
Do not `exit` from `source`d scripts
Fix DeepSpeed setup by specifying commit hash
Update remaining paths with new project name
Use patch file from repository
extended README for using StartLongJobs,bash,
changed paths to opengptx-elm and added StartLongRun.bash to start multiple runs, changes in tr1-13...sbatch, s.t. paths are only set if not already set
Merge branch 'add_automatic_checkpoint_and_restart' of https://gitlab.jsc.fz-juelich.de/opengptx/bigscience-code into add_automatic_checkpoint_and_restart
add_automatic_c…
add_automatic_checkpoint_and_restart
extended README for using StartLongJobs,bash,
Delete test
lower runtime set in default
forgot to uncomment
forgot to uncomment after testing
added specific login node in tensorboard port forwarding suggestion
bugfixes
actually calling sbatch
changed paths to opengptx-elm and added StartLongRun.bash to start multiple runs, changes in tr1-13...sbatch, s.t. paths are only set if not already set
change project name
Prefer SYSTEMNAME variable to /etc/FZJ/systemname
Quit upon execution-location errors
Explain activating working environment
Ignore error from empty Git stash
Use dynamic temp directory for building DeepSpeed
Uninstall DeepSpeed without asking
Fix return value
Explain partitions
Quit when any variable is not set
Link to Megatron-DeepSpeed repository
Initial commit
Loading