Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
2
2023-nov-intro-to-supercompting-jsc
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Alexandre Strube
2023-nov-intro-to-supercompting-jsc
Commits
730f918c
Commit
730f918c
authored
1 year ago
by
Alexandre Strube
Browse files
Options
Downloads
Patches
Plain Diff
Final fixes to the code
parent
a2aff205
No related branches found
No related tags found
No related merge requests found
Pipeline
#165855
passed
1 year ago
Stage: test
Stage: deploy
Changes
3
Pipelines
1
Expand all
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
01-deep-learning-on-supercomputers.md
+57
-11
57 additions, 11 deletions
01-deep-learning-on-supercomputers.md
public/01-deep-learning-on-supercomputers.html
+172
-134
172 additions, 134 deletions
public/01-deep-learning-on-supercomputers.html
src/serial.slurm
+1
-1
1 addition, 1 deletion
src/serial.slurm
with
230 additions
and
146 deletions
01-deep-learning-on-supercomputers.md
+
57
−
11
View file @
730f918c
...
@@ -394,10 +394,16 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompti
...
@@ -394,10 +394,16 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompti
---
---
##
Again
, please access the slides to clone repository:
##
If you haven't done so
, please access the slides to clone repository:


-
```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc.git
```
---
---
## DEMO TIME!
## DEMO TIME!
...
@@ -434,13 +440,13 @@ from fastai.vision.models.xresnet import *
...
@@ -434,13 +440,13 @@ from fastai.vision.models.xresnet import *
---
---
## Bringing your data in
## Bringing your data in
*
```
python
```
python
from
fastai.vision.all
import
*
from
fastai.vision.all
import
*
from
fastai.distributed
import
*
from
fastai.distributed
import
*
from
fastai.vision.models.xresnet
import
*
from
fastai.vision.models.xresnet
import
*
# DOWNLOADS DATASET - we need to do this on the login node
path
=
untar_data
(
URLs
.
IMAGEWOOF_320
)
path
=
untar_data
(
URLs
.
IMAGEWOOF_320
)
...
@@ -457,7 +463,6 @@ path = untar_data(URLs.IMAGEWOOF_320)
...
@@ -457,7 +463,6 @@ path = untar_data(URLs.IMAGEWOOF_320)
```
```
---
---
## Loading your data
## Loading your data
...
@@ -515,17 +520,21 @@ learn.fine_tune(6)
...
@@ -515,17 +520,21 @@ learn.fine_tune(6)
-
Only add new requirements
-
Only add new requirements
-
[
Link to gitlab repo
](
https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template
)
-
[
Link to gitlab repo
](
https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template
)
-
```bash
-
```bash
cd $HOME/2023-nov-intro-to-supercompting-jsc.git/src
git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
```
```
-
Add this to requirements.txt:
-
Add this to
sc_venv_template/
requirements.txt:
-
```python
-
```python
fastai
fastai
deepspeed
deepspeed
accelerate
accelerate
```
```
-
(the last one will become
`accelerate`
later this week)
-
Run
`./setup.sh`
-
```bash
-
`source activate.sh`
sc_venv_template/setup.sh
source sc_venv_template/activate.sh
```
-
Done! You installed everything you need
-
Done! You installed everything you need
---
---
...
@@ -533,7 +542,7 @@ accelerate
...
@@ -533,7 +542,7 @@ accelerate
## Submission Script
## Submission Script
```
bash
```
bash
#!/bin/bash
-x
#!/bin/bash
#SBATCH --account=training2334
#SBATCH --account=training2334
#SBATCH --nodes=1
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --job-name=ai-serial
...
@@ -556,6 +565,33 @@ time srun python serial.py
...
@@ -556,6 +565,33 @@ time srun python serial.py
---
---
## Download dataset
-
Compute nodes have no internet
-
We need to download the dataset
---
## Download dataset
```
bash
cd
$HOME
/2023-nov-intro-to-supercompting-jsc/src
source
sc_venv_template/activate.sh
python serial.py
(
Some warnings
)
epoch train_loss valid_loss accuracy top_k_accuracy
time
Epoch 1/1 : |-------------------------------------------------------------| 0.71%
[
1/141 00:07<16:40]
```
-
It started training, on the login node's CPUs (WRONG!!!)
-
That means we have the data!
-
We just cancel with Ctrl+C
---
## Running it
## Running it
-
```
bash
-
```
bash
...
@@ -564,12 +600,13 @@ sbatch serial.slurm
...
@@ -564,12 +600,13 @@ sbatch serial.slurm
```
```
-
On Juwels Booster, should take about 5 minutes
-
On Juwels Booster, should take about 5 minutes
-
On a cpu system this would take half a day
-
On a cpu system this would take half a day
-
Check the out-serial-XXXXXX and err-serial-XXXXXX files
---
---
## Going data parallel
## Going data parallel
-
S
ame code as before, let's show the differences
-
Almost s
ame code as before, let's show the differences
---
---
...
@@ -630,6 +667,13 @@ with learn.distrib_ctx():
...
@@ -630,6 +667,13 @@ with learn.distrib_ctx():
-
Please check the course repository:
[
src/distrib.slurm
](
https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm
)
-
Please check the course repository:
[
src/distrib.slurm
](
https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm
)
-
Main differences:
-
```bash
#SBATCH --cpus-per-task=48
#SBATCH --gres=gpu:4
```
---
---
## Let's check the outputs!
## Let's check the outputs!
...
@@ -670,6 +714,7 @@ real 1m19.979s
...
@@ -670,6 +714,7 @@ real 1m19.979s
-
Distributed run suffered a bit on the accuracy 🎯 and loss 😩
-
Distributed run suffered a bit on the accuracy 🎯 and loss 😩
-
In exchange for speed 🏎️
-
In exchange for speed 🏎️
-
Train a bit longer and you're good!
-
It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
-
It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
-
I/O is automatically parallelized / sharded by Fast.AI library
-
I/O is automatically parallelized / sharded by Fast.AI library
-
Data parallel is a simple and effective way to distribute DL workload 💪
-
Data parallel is a simple and effective way to distribute DL workload 💪
...
@@ -708,6 +753,7 @@ real 1m15.651s
...
@@ -708,6 +753,7 @@ real 1m15.651s
-
Accuracy and loss suffered
-
Accuracy and loss suffered
-
This is a very simple model, so it's not surprising
-
This is a very simple model, so it's not surprising
-
It fits into 4gb, we "stretched" it to a 320gb system
-
It fits into 4gb, we "stretched" it to a 320gb system
-
It's not a good fit for this system
-
You need bigger models to really exercise the gpu and scaling
-
You need bigger models to really exercise the gpu and scaling
-
There's a lot more to that, but for now, let's focus on medium/big sized models
-
There's a lot more to that, but for now, let's focus on medium/big sized models
-
For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!
-
For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!
...
...
This diff is collapsed.
Click to expand it.
public/01-deep-learning-on-supercomputers.html
+
172
−
134
View file @
730f918c
This diff is collapsed.
Click to expand it.
src/serial.slurm
+
1
−
1
View file @
730f918c
#!/bin/bash
-x
#!/bin/bash
#SBATCH --account=training2334
#SBATCH --account=training2334
#SBATCH --nodes=1
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --job-name=ai-serial
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment