index.md

author: Alexandre Strube
title: Deep Learning on Supercomputers
# subtitle: A primer in supercomputers`
date: November 13, 2024

---

## Deep learning is...

![](images/matrix-multiplication-meme.jpg)

---

## Matrix Multiplication Recap

![](images/matrix-multiplication.svg)

---

## Parallelism in a GPU

- Operations on each element of the matrix are independent
- Ergo, each element can be computed in parallel
- Extremely good to apply the same operation to a lot of data
- ![](images/gpu-parallel.gif)
- Paralellization done "for free" by the ML toolkits

---

## Are we there yet?

![](images/are-we-there-yet.gif)

---

## Not quite

- Compute in one GPU is parallel, yes
- ![](images/gpu0.svg)

---

## But what about many GPUs?

- It's when things get interesting

---

## Data Parallel

![](images/data-parallel.svg)

---

## Data Parallel

![](images/data-parallel-multiple-data.svg)

---

## Data Parallel - Averaging

![](images/data-parallel-averaging.svg)

---

## Data Parallel

### There are other approaches too, e.g.

- For the sake of completeness:
    - Asynchronous Stochastic Gradient Descent
        - Don't average the parameters, but send the updates (gradients post learning rate and momentum) asynchronously
        - Advantageous for slow networks
        - Problem: stale gradient (things might change while calculating gradients)
        - The more nodes, the worse it gets
        - Won't talk about it anymore

---

## Data Parallel

### There are other approaches too!

- Decentralized Asychronous Stochastic Gradient Descent
    - Updates are peer-to-peer
    - The updates are heavily compressed and quantized
    - Disadvantage: extra computation per minibatch, more memory needed

- WE DON'T NEED THOSE

---

## That's it for data parallel!

- Use different data for each GPU
- Everything else is the same
- Average after each epoch

---

## Well, almost...

---

## There are more levels!

![](images/lets-go-deeper.jpg)

---

## Data Parallel - Multi Node

![](images/data-parallel-multi-node.svg)

---

## Data Parallel - Multi Node

![](images/data-parallel-multi-node-averaging.svg)

---

## Before we go further...

- Data parallel is usually good enough 👌
- If you need more than this, you should be giving this course, not me 🤷‍♂️

---

## Are we there yet?

![](images/are-we-there-yet-2.gif)

---

## Model Parallel

- Model *itself* is too big to fit in one single GPU 🐋
- Each GPU holds a slice of the model 🍕
- Data moves from one GPU to the next

---

## Model Parallel

![](images/model-parallel.svg)

---


## Model Parallel

![](images/model-parallel-pipeline-1.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-2.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-3.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-4.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-5.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-6.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-7.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-8.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-9.svg)

---

## Model Parallel

![](images/model-parallel-pipeline-10.svg)

---

## What's the problem here? 🧐

---

## Model Parallel

- Waste of resources
- While one GPU is working, others are waiting the whole process to end
- ![](images/no_pipe.png)
    - [Source: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)


---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-1.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-2-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-3-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-4-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-5-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-6-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-7-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-8-multibatch.svg)

---

## Model Parallel - Pipelining

![](images/model-parallel-pipeline-9-multibatch.svg)

---

## This is an oversimplification!

- Actually, you split the input minibatch into multiple microbatches.
- There's still idle time - an unavoidable "bubble" 🫧
- ![](images/pipe.png)

---

## Are we there yet?

![](images/are-we-there-yet-3.gif)

---

## Model Parallel - Multi Node

- In this case, each node does the same as the others.
- At each step, they all synchronize their weights.

---

## Model Parallel - Multi Node

![](images/model-parallel-multi-node.svg)

---

## Model Parallel - going bigger

- You can also have layers spreaded over multiple gpus
- One can even pipeline among nodes....

---

## Recap

- Data parallelism:
    - Split the data over multiple GPUs
    - Each GPU runs the whole model
    - The gradients are averaged at each step
- Data parallelism, multi-node:
    - Same, but gradients are averaged across nodes
- Model parallelism:
    - Split the model over multiple GPUs
    - Each GPU does the forward/backward pass
    - The gradients are averaged at the end
- Model parallelism, multi-node:
    - Same, but gradients are averaged across nodes

---

## Are we there yet?

![](images/are-we-there-yet-4.gif)

---

## If you haven't done so, please access the slides to clone repository:

![](images/slides.png)

- ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *


from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
# DOWNLOADS DATASET - we need to do this on the login node
path = untar_data(URLs.IMAGEWOOF_320)


from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)


from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()

learn.fine_tune(6)
- Add this to sc_venv_template/requirements.txt:
- ```python
# Add here the pip packages you would like to install on this virtual environment / kernel
pip
fastai==2.7.15
scipy==1.11.1
matplotlib==3.7.2
scikit-learn==1.3.1
pandas==2.0.3
torch==2.1.2
accelerate

- Done! You installed everything you need

---

## Submission Script

```bash
#!/bin/bash
#SBATCH --account=training2436
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00
#SBATCH --partition=dc-gpu

# Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

# Run the demo
time srun python serial.py
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
source sc_venv_template/activate.sh
python serial.py

(Some warnings)
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time
Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
- On Juwels Booster, should take about 5 minutes
- On a cpu system this would take half a day
- Check the out-serial-XXXXXX and err-serial-XXXXXX files

---

## Going data parallel

- Almost same code as before, let's show the differences

---

## Data parallel

```python
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
    learn.fine_tune(6)
path = untar_data(URLs.IMAGEWOOF_320)
path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
learn.fine_tune(6)
with learn.distrib_ctx():
    learn.fine_tune(6)

---

## Let's check the outputs!

#### Single gpu:

```bash
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time
0         2.249933    2.152813    0.225757  0.750573        01:11
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time
0         1.882008    1.895813    0.324510  0.832018        00:44
1         1.837312    1.916380    0.374141  0.845253        00:44
2         1.717144    1.739026    0.378722  0.869941        00:43
3         1.594981    1.637526    0.417664  0.891575        00:44
4         1.460454    1.410519    0.507254  0.920336        00:44
5         1.389946    1.304924    0.538814  0.935862        00:43
real	5m44.972s
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time
0         2.201540    2.799354    0.202950  0.662513        00:09
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time
0         1.951004    2.059517    0.294761  0.781282        00:08
1         1.929561    1.999069    0.309512  0.792981        00:08
2         1.854629    1.962271    0.314344  0.840285        00:08
3         1.754019    1.687136    0.404883  0.872330        00:08
4         1.643759    1.499526    0.482706  0.906409        00:08
5         1.554356    1.450976    0.502798  0.914547        00:08
real	1m19.979s

---

## Some insights

- It's faster per epoch, but not by much (5 seconds vs 8 seconds)
- Accuracy and loss suffered
- This is a very simple model, so it's not surprising
    - It fits into 4gb, we "stretched" it to a 320gb system
    - It's not a good fit for this system
- You need bigger models to really exercise the gpu and scaling
- There's a lot more to that, but for now, let's focus on medium/big sized models
    - For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!

---

## That's all folks!

- Thanks for listening!
- Questions?

---

## References

- [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/distributed.pipelining.html)
- [Intro to Distributed Deep Learning](https://xiandong79.github.io/Intro-Distributed-Deep-Learning)
- [Model Parallelism - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html)