- Resources:
- Team:
- Goal for this talk:
- Slides on your own computer:
- Please access it now, so you can follow along:
- Git clone this repository
- DEMO TIME!
- Expected imports
- Bringing your data in*
- Loading your data
- Single-gpu code
- Venv_template
- Download dataset
- Download dataset
- Running it
- Data Parallel
- What changed?
- Submission script: data parallel
- Multi gpu:
- Some insights
- Multi-node
- Multi-node
author: Alexandre Strube
title: Deep Learning on Supercomputers
# subtitle: A primer in supercomputers`
date: November 13, 2024
Resources:
Team:
::: {.container}
:::: {.col}
::::
:::: {.col}
::::
:::: {.col}
::::
:::
Goal for this talk:
- Show how DL workloads are distributed
- on a mult-gpu, multi-node system
- like a supercomputer
- Important: This is an overview, NOT a basic AI course!
- We have introductory courses on AI on supercomputers
-
Slides on your own computer:
Please access it now, so you can follow along:
https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc
Git clone this repository
- All slides and source code
- Connect to the supercomputer and do this:
-
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
---
## Deep learning is...

---
## Matrix Multiplication Recap

---
## Parallelism in a GPU
- Operations on each element of the matrix are independent
- Ergo, each element can be computed in parallel
- Extremely good to apply the same operation to a lot of data
- 
- Paralellization done "for free" by the ML toolkits
---
## Are we there yet?

---
## Not quite
- Compute in one GPU is parallel, yes
- 
---
## But what about many GPUs?
- It's when things get interesting
---
## Data Parallel

---
## Data Parallel

---
## Data Parallel - Averaging

---
## Data Parallel
### There are other approaches too, e.g.
- For the sake of completeness:
- Asynchronous Stochastic Gradient Descent
- Don't average the parameters, but send the updates (gradients post learning rate and momentum) asynchronously
- Advantageous for slow networks
- Problem: stale gradient (things might change while calculating gradients)
- The more nodes, the worse it gets
- Won't talk about it anymore
---
## Data Parallel
### There are other approaches too!
- Decentralized Asychronous Stochastic Gradient Descent
- Updates are peer-to-peer
- The updates are heavily compressed and quantized
- Disadvantage: extra computation per minibatch, more memory needed
- WE DON'T NEED THOSE
---
## That's it for data parallel!
- Use different data for each GPU
- Everything else is the same
- Average after each epoch
---
## Well, almost...
---
## There are more levels!

---
## Data Parallel - Multi Node

---
## Data Parallel - Multi Node

---
## Before we go further...
- Data parallel is usually good enough 👌
- If you need more than this, you should be giving this course, not me 🤷♂️
---
## Are we there yet?

---
## Model Parallel
- Model *itself* is too big to fit in one single GPU 🐋
- Each GPU holds a slice of the model 🍕
- Data moves from one GPU to the next
---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## Model Parallel

---
## What's the problem here? 🧐
---
## Model Parallel
- Waste of resources
- While one GPU is working, others are waiting the whole process to end
- 
- [Source: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## Model Parallel - Pipelining

---
## This is an oversimplification!
- Actually, you split the input minibatch into multiple microbatches.
- There's still idle time - an unavoidable "bubble" 🫧
- 
---
## Are we there yet?

---
## Model Parallel - Multi Node
- In this case, each node does the same as the others.
- At each step, they all synchronize their weights.
---
## Model Parallel - Multi Node

---
## Model Parallel - going bigger
- You can also have layers spreaded over multiple gpus
- One can even pipeline among nodes....
---
## Recap
- Data parallelism:
- Split the data over multiple GPUs
- Each GPU runs the whole model
- The gradients are averaged at each step
- Data parallelism, multi-node:
- Same, but gradients are averaged across nodes
- Model parallelism:
- Split the model over multiple GPUs
- Each GPU does the forward/backward pass
- The gradients are averaged at the end
- Model parallelism, multi-node:
- Same, but gradients are averaged across nodes
---
## Are we there yet?

---
## If you haven't done so, please access the slides to clone repository:

- ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
DEMO TIME!
- Let's take a simple model
- Run it "serially" (single-gpu)
- We make it data parallel among multiple gpus in one node
- Then we make it multi-node!
Expected imports
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
Bringing your data in*
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
# DOWNLOADS DATASET - we need to do this on the login node
path = untar_data(URLs.IMAGEWOOF_320)
Loading your data
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
Single-gpu code
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
learn.fine_tune(6)
Venv_template
- EXTREMELY useful thing to keep your software in order
- Make a venv with the correct supercomputer modules
- Only add new requirements
- Link to gitlab repo
-
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
- Add this to sc_venv_template/requirements.txt:
- ```python
# Add here the pip packages you would like to install on this virtual environment / kernel
pip
fastai==2.7.15
scipy==1.11.1
matplotlib==3.7.2
scikit-learn==1.3.1
pandas==2.0.3
torch==2.1.2
accelerate
sc_venv_template/setup.sh source sc_venv_template/activate.sh
- Done! You installed everything you need
---
## Submission Script
```bash
#!/bin/bash
#SBATCH --account=training2436
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00
#SBATCH --partition=dc-gpu
# Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
# This loads modules and python packages
source sc_venv_template/activate.sh
# Run the demo
time srun python serial.py
Download dataset
- Compute nodes have no internet
- We need to download the dataset
Download dataset
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
source sc_venv_template/activate.sh
python serial.py
(Some warnings)
epoch train_loss valid_loss accuracy top_k_accuracy time
Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
- It started training, on the login node's CPUs (WRONG!!!)
- That means we have the data!
- We just cancel with Ctrl+C
Running it
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src sbatch serial.slurm
- On Juwels Booster, should take about 5 minutes
- On a cpu system this would take half a day
- Check the out-serial-XXXXXX and err-serial-XXXXXX files
---
## Going data parallel
- Almost same code as before, let's show the differences
---
## Data parallel
```python
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
learn.fine_tune(6)
Data Parallel
What changed?
It was
path = untar_data(URLs.IMAGEWOOF_320)
Became
path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
It was
learn.fine_tune(6)
Became
with learn.distrib_ctx():
learn.fine_tune(6)
Submission script: data parallel
-
Please check the course repository: src/distrib.slurm
-
Main differences:
-
#SBATCH --cpus-per-task=48 #SBATCH --gres=gpu:4
---
## Let's check the outputs!
#### Single gpu:
```bash
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.249933 2.152813 0.225757 0.750573 01:11
epoch train_loss valid_loss accuracy top_k_accuracy time
0 1.882008 1.895813 0.324510 0.832018 00:44
1 1.837312 1.916380 0.374141 0.845253 00:44
2 1.717144 1.739026 0.378722 0.869941 00:43
3 1.594981 1.637526 0.417664 0.891575 00:44
4 1.460454 1.410519 0.507254 0.920336 00:44
5 1.389946 1.304924 0.538814 0.935862 00:43
real 5m44.972s
Multi gpu:
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.201540 2.799354 0.202950 0.662513 00:09
epoch train_loss valid_loss accuracy top_k_accuracy time
0 1.951004 2.059517 0.294761 0.781282 00:08
1 1.929561 1.999069 0.309512 0.792981 00:08
2 1.854629 1.962271 0.314344 0.840285 00:08
3 1.754019 1.687136 0.404883 0.872330 00:08
4 1.643759 1.499526 0.482706 0.906409 00:08
5 1.554356 1.450976 0.502798 0.914547 00:08
real 1m19.979s
Some insights
- Distributed run suffered a bit on the accuracy
and loss- In exchange for speed
- Train a bit longer and you're good!
- In exchange for speed
- It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
- I/O is automatically parallelized / sharded by Fast.AI library
- Data parallel is a simple and effective way to distribute DL workload
- This is really just a primer - there's much more to that
- I/O plays a HUGE role on Supercomputers, for example
Multi-node
- Simply change
#SBATCH --nodes=2
on the submission file! - THAT'S IT
Multi-node
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.242036 2.192690 0.201728 0.681148 00:10
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.035004 2.084082 0.246189 0.748984 00:05
1 1.981432 2.054528 0.247205 0.764482 00:05
2 1.942930 1.918441 0.316057 0.821138 00:05
3 1.898426 1.832725 0.370173 0.839431 00:05
4 1.859066 1.781805 0.375508 0.858740 00:05
5 1.820968 1.743448 0.394055 0.864583 00:05
real 1m15.651s
---
## Some insights
- It's faster per epoch, but not by much (5 seconds vs 8 seconds)
- Accuracy and loss suffered
- This is a very simple model, so it's not surprising
- It fits into 4gb, we "stretched" it to a 320gb system
- It's not a good fit for this system
- You need bigger models to really exercise the gpu and scaling
- There's a lot more to that, but for now, let's focus on medium/big sized models
- For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!
---
## That's all folks!
- Thanks for listening!
- Questions?
---
## References
- [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/distributed.pipelining.html)
- [Intro to Distributed Deep Learning](https://xiandong79.github.io/Intro-Distributed-Deep-Learning)
- [Model Parallelism - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html)