Skip to content
Snippets Groups Projects
Commit faeb5b8f authored by Alexandre Strube's avatar Alexandre Strube
Browse files

Update course materials and scripts for March 2025 event

parent 37eede42
Branches
No related tags found
No related merge requests found
Pipeline #258654 passed
---
author: Alexandre Strube
title: Getting Started with AI on Supercomputers
title: Overview of AI distribution techniques
---
This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/201)
This repo is specifically for the course described in [indico](https://nxtaim.de/en/events-en/)
---
......@@ -18,7 +18,7 @@ This repo is specifically for the course described in [indico](https://indico3-j
Please, fork this thing! Use it! And submit merge requests!
## Authors and acknowledgment
Alexandre Otto Strube, May 2024
Alexandre Otto Strube, March 2025
## Certificate
Human resources make them.
......
......@@ -2,12 +2,12 @@
author: Alexandre Strube
title: Deep Learning on Supercomputers
# subtitle: A primer in supercomputers`
date: November 13, 2024
date: March 13, 2025
---
## Resources:
- [This document](https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc)
- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc)
- [This document](https://strube1.pages.jsc.fz-juelich.de/2025-03-talk-nxtaim)
- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim)
![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg)
......@@ -20,10 +20,7 @@ date: November 13, 2024
![Alexandre Strube](pics/alex.jpg)
::::
:::: {.col}
![Ilya Zhukov](pics/ilya.jpg)
::::
:::: {.col}
![Jolanta Zjupa](pics/jolanta.jpg)
![Sabrina Benassou](pics/sabrina.jpg)
::::
:::
......@@ -37,7 +34,7 @@ date: November 13, 2024
- on a mult-gpu, multi-node system
- like a supercomputer 🤯
- Important: This is an overview, _*NOT*_ a basic AI course!
- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4)
- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2025/ai-sc-1)
- ![](images/bringing-dl-workloads-2024-2-course.png)
![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg)
......@@ -47,21 +44,21 @@ date: November 13, 2024
### Please access it now, so you can follow along:
[https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc)
[https://go.fzj.de/2025-03-nxtaim](https://go.fzj.de/2025-03-nxtaim)
![](images/slides.png)
---
## Git clone this repository
<!-- ## Git clone this repository
- All slides and source code
- Connect to the supercomputer and do this:
- ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.git
```
---
--- -->
## Deep learning is...
......@@ -369,6 +366,22 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco
---
## Fully Sharded Data Parallelism
- Shards model parameters, optimizer states and gradients across DDP ranks.
- Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations:
- ![](images/FSDP-graph-2a.png.webp){ width=450px }
---
## Fully Sharded Data Parallelism
- Reduces the memory footprint of each GPU
- Increases the communication volume
- Allows for massive scaling (100000+ GPUs)
---
## Recap
- Data parallelism:
......@@ -386,18 +399,33 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco
---
## Recap
- Pipelining:
- Split the model over multiple GPUs
- Each GPU does a part of the forward pass
- The gradients are averaged at the end
- Pipelining, multi-node:
- Same, but gradients are averaged across nodes
- Fully Sharded Data Parallelism:
- Split the model, optimizer states and gradients across DDP ranks
- Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations
- Lower memory, higher communication volume
---
## Are we there yet?
![](images/are-we-there-yet-4.gif)
---
## If you haven't done so, please access the slides to clone repository:
## You can clone the repo yourself
![](images/slides.png)
- ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.gif
```
......@@ -518,7 +546,7 @@ learn.fine_tune(6)
- Only add new requirements
- [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template)
- ```bash
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
cd $HOME/2025-03-talk-nxtaim/src
git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
```
- Add this to sc_venv_template/requirements.txt:
......@@ -558,7 +586,7 @@ source sc_venv_template/activate.sh
#SBATCH --partition=dc-gpu
# Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
cd $HOME/2025-03-talk-nxtaim/src
# This loads modules and python packages
source sc_venv_template/activate.sh
......@@ -579,7 +607,7 @@ time srun python serial.py
## Download dataset
```bash
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
cd $HOME/2025-03-talk-nxtaim/src
source sc_venv_template/activate.sh
python serial.py
......@@ -599,7 +627,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71
## Running it
- ```bash
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
cd $HOME/2025-03-talk-nxtaim/src
sbatch serial.slurm
```
- On Juwels Booster, should take about 5 minutes
......@@ -669,7 +697,7 @@ with learn.distrib_ctx():
## Submission script: data parallel
- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm)
- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim/-/blob/main/src/distrib.slurm)
- Main differences:
......
......@@ -4,7 +4,7 @@
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="author" content="Alexandre Strube">
<title>Getting Started with AI on Supercomputers</title>
<title>Overview of AI distribution techniques</title>
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
......@@ -160,14 +160,14 @@
<div class="slides">
<section id="title-slide">
<h1 class="title">Getting Started with AI on Supercomputers</h1>
<h1 class="title">Overview of AI distribution techniques</h1>
<p class="author">Alexandre Strube</p>
</section>
<section class="slide level2">
<p>This repo is specifically for the course described in <a
href="https://indico3-jsc.fz-juelich.de/event/201">indico</a></p>
href="https://nxtaim.de/en/events-en/">indico</a></p>
</section>
<section class="slide level2">
......@@ -185,7 +185,7 @@ the self-contained HTML files.</li>
</section>
<section id="authors-and-acknowledgment" class="slide level2">
<h2>Authors and acknowledgment</h2>
<p>Alexandre Otto Strube, May 2024</p>
<p>Alexandre Otto Strube, March 2025</p>
</section>
<section id="certificate" class="slide level2">
<h2>Certificate</h2>
......
public/images/FSDP-graph-2a.png.webp

38 KiB

public/images/slides.png

652 B | W: | H:

public/images/slides.png

457 B | W: | H:

public/images/slides.png
public/images/slides.png
public/images/slides.png
public/images/slides.png
  • 2-up
  • Swipe
  • Onion skin
This diff is collapsed.
public/pics/sabrina.jpg

37.7 KiB

#!/bin/bash
#SBATCH --account=training2436
#SBATCH --account=SOME_ACCOUNT
#SBATCH --nodes=1
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
......@@ -7,7 +7,7 @@
#SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu
#SBATCH --partition=dc-gpu # on JURECA
#SBATCH --gres=gpu:4
# Without this, srun does not inherit cpus-per-task from sbatch.
......@@ -23,7 +23,7 @@ export MASTER_PORT=7010
export GPUS_PER_NODE=4
# Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
cd $HOME/2025-03-talk-nxtaim/src
# This loads modules and python packages
source sc_venv_template/activate.sh
......
#!/bin/bash
#SBATCH --account=training2436
#SBATCH --account=SOME_ACCOUNT
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1
......@@ -7,11 +7,11 @@
#SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00
#SBATCH --partition=dc-gpu
#SBATCH --partition=dc-gpu # on JURECA
#SBATCH --gres=gpu:1
# Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
cd $HOME/2025-03-talk-nxtaim/src
# This loads modules and python packages
source sc_venv_template/activate.sh
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment