Skip to content
Snippets Groups Projects
Commit faeb5b8f authored by Alexandre Strube's avatar Alexandre Strube
Browse files

Update course materials and scripts for March 2025 event

parent 37eede42
Branches
No related tags found
No related merge requests found
Pipeline #258654 passed
--- ---
author: Alexandre Strube author: Alexandre Strube
title: Getting Started with AI on Supercomputers title: Overview of AI distribution techniques
--- ---
This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/201) This repo is specifically for the course described in [indico](https://nxtaim.de/en/events-en/)
--- ---
...@@ -18,7 +18,7 @@ This repo is specifically for the course described in [indico](https://indico3-j ...@@ -18,7 +18,7 @@ This repo is specifically for the course described in [indico](https://indico3-j
Please, fork this thing! Use it! And submit merge requests! Please, fork this thing! Use it! And submit merge requests!
## Authors and acknowledgment ## Authors and acknowledgment
Alexandre Otto Strube, May 2024 Alexandre Otto Strube, March 2025
## Certificate ## Certificate
Human resources make them. Human resources make them.
......
...@@ -2,12 +2,12 @@ ...@@ -2,12 +2,12 @@
author: Alexandre Strube author: Alexandre Strube
title: Deep Learning on Supercomputers title: Deep Learning on Supercomputers
# subtitle: A primer in supercomputers` # subtitle: A primer in supercomputers`
date: November 13, 2024 date: March 13, 2025
--- ---
## Resources: ## Resources:
- [This document](https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc) - [This document](https://strube1.pages.jsc.fz-juelich.de/2025-03-talk-nxtaim)
- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc) - [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim)
![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg) ![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg)
...@@ -20,10 +20,7 @@ date: November 13, 2024 ...@@ -20,10 +20,7 @@ date: November 13, 2024
![Alexandre Strube](pics/alex.jpg) ![Alexandre Strube](pics/alex.jpg)
:::: ::::
:::: {.col} :::: {.col}
![Ilya Zhukov](pics/ilya.jpg) ![Sabrina Benassou](pics/sabrina.jpg)
::::
:::: {.col}
![Jolanta Zjupa](pics/jolanta.jpg)
:::: ::::
::: :::
...@@ -37,7 +34,7 @@ date: November 13, 2024 ...@@ -37,7 +34,7 @@ date: November 13, 2024
- on a mult-gpu, multi-node system - on a mult-gpu, multi-node system
- like a supercomputer 🤯 - like a supercomputer 🤯
- Important: This is an overview, _*NOT*_ a basic AI course! - Important: This is an overview, _*NOT*_ a basic AI course!
- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4) - We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2025/ai-sc-1)
- ![](images/bringing-dl-workloads-2024-2-course.png) - ![](images/bringing-dl-workloads-2024-2-course.png)
![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg) ![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg)
...@@ -47,21 +44,21 @@ date: November 13, 2024 ...@@ -47,21 +44,21 @@ date: November 13, 2024
### Please access it now, so you can follow along: ### Please access it now, so you can follow along:
[https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc) [https://go.fzj.de/2025-03-nxtaim](https://go.fzj.de/2025-03-nxtaim)
![](images/slides.png) ![](images/slides.png)
--- ---
## Git clone this repository <!-- ## Git clone this repository
- All slides and source code - All slides and source code
- Connect to the supercomputer and do this: - Connect to the supercomputer and do this:
- ```bash - ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.git
``` ```
--- --- -->
## Deep learning is... ## Deep learning is...
...@@ -369,6 +366,22 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco ...@@ -369,6 +366,22 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco
--- ---
## Fully Sharded Data Parallelism
- Shards model parameters, optimizer states and gradients across DDP ranks.
- Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations:
- ![](images/FSDP-graph-2a.png.webp){ width=450px }
---
## Fully Sharded Data Parallelism
- Reduces the memory footprint of each GPU
- Increases the communication volume
- Allows for massive scaling (100000+ GPUs)
---
## Recap ## Recap
- Data parallelism: - Data parallelism:
...@@ -386,18 +399,33 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco ...@@ -386,18 +399,33 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco
--- ---
## Recap
- Pipelining:
- Split the model over multiple GPUs
- Each GPU does a part of the forward pass
- The gradients are averaged at the end
- Pipelining, multi-node:
- Same, but gradients are averaged across nodes
- Fully Sharded Data Parallelism:
- Split the model, optimizer states and gradients across DDP ranks
- Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations
- Lower memory, higher communication volume
---
## Are we there yet? ## Are we there yet?
![](images/are-we-there-yet-4.gif) ![](images/are-we-there-yet-4.gif)
--- ---
## If you haven't done so, please access the slides to clone repository: ## You can clone the repo yourself
![](images/slides.png) ![](images/slides.png)
- ```bash - ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.gif
``` ```
...@@ -518,7 +546,7 @@ learn.fine_tune(6) ...@@ -518,7 +546,7 @@ learn.fine_tune(6)
- Only add new requirements - Only add new requirements
- [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template) - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template)
- ```bash - ```bash
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src cd $HOME/2025-03-talk-nxtaim/src
git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
``` ```
- Add this to sc_venv_template/requirements.txt: - Add this to sc_venv_template/requirements.txt:
...@@ -558,7 +586,7 @@ source sc_venv_template/activate.sh ...@@ -558,7 +586,7 @@ source sc_venv_template/activate.sh
#SBATCH --partition=dc-gpu #SBATCH --partition=dc-gpu
# Make sure we are on the right directory # Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src cd $HOME/2025-03-talk-nxtaim/src
# This loads modules and python packages # This loads modules and python packages
source sc_venv_template/activate.sh source sc_venv_template/activate.sh
...@@ -579,7 +607,7 @@ time srun python serial.py ...@@ -579,7 +607,7 @@ time srun python serial.py
## Download dataset ## Download dataset
```bash ```bash
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src cd $HOME/2025-03-talk-nxtaim/src
source sc_venv_template/activate.sh source sc_venv_template/activate.sh
python serial.py python serial.py
...@@ -599,7 +627,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71 ...@@ -599,7 +627,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71
## Running it ## Running it
- ```bash - ```bash
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src cd $HOME/2025-03-talk-nxtaim/src
sbatch serial.slurm sbatch serial.slurm
``` ```
- On Juwels Booster, should take about 5 minutes - On Juwels Booster, should take about 5 minutes
...@@ -669,7 +697,7 @@ with learn.distrib_ctx(): ...@@ -669,7 +697,7 @@ with learn.distrib_ctx():
## Submission script: data parallel ## Submission script: data parallel
- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim/-/blob/main/src/distrib.slurm)
- Main differences: - Main differences:
......
...@@ -4,7 +4,7 @@ ...@@ -4,7 +4,7 @@
<meta charset="utf-8"> <meta charset="utf-8">
<meta name="generator" content="pandoc"> <meta name="generator" content="pandoc">
<meta name="author" content="Alexandre Strube"> <meta name="author" content="Alexandre Strube">
<title>Getting Started with AI on Supercomputers</title> <title>Overview of AI distribution techniques</title>
<meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui"> <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
...@@ -160,14 +160,14 @@ ...@@ -160,14 +160,14 @@
<div class="slides"> <div class="slides">
<section id="title-slide"> <section id="title-slide">
<h1 class="title">Getting Started with AI on Supercomputers</h1> <h1 class="title">Overview of AI distribution techniques</h1>
<p class="author">Alexandre Strube</p> <p class="author">Alexandre Strube</p>
</section> </section>
<section class="slide level2"> <section class="slide level2">
<p>This repo is specifically for the course described in <a <p>This repo is specifically for the course described in <a
href="https://indico3-jsc.fz-juelich.de/event/201">indico</a></p> href="https://nxtaim.de/en/events-en/">indico</a></p>
</section> </section>
<section class="slide level2"> <section class="slide level2">
...@@ -185,7 +185,7 @@ the self-contained HTML files.</li> ...@@ -185,7 +185,7 @@ the self-contained HTML files.</li>
</section> </section>
<section id="authors-and-acknowledgment" class="slide level2"> <section id="authors-and-acknowledgment" class="slide level2">
<h2>Authors and acknowledgment</h2> <h2>Authors and acknowledgment</h2>
<p>Alexandre Otto Strube, May 2024</p> <p>Alexandre Otto Strube, March 2025</p>
</section> </section>
<section id="certificate" class="slide level2"> <section id="certificate" class="slide level2">
<h2>Certificate</h2> <h2>Certificate</h2>
......
public/images/FSDP-graph-2a.png.webp

38 KiB

public/images/slides.png

652 B | W: | H:

public/images/slides.png

457 B | W: | H:

public/images/slides.png
public/images/slides.png
public/images/slides.png
public/images/slides.png
  • 2-up
  • Swipe
  • Onion skin
This diff is collapsed.
public/pics/sabrina.jpg

37.7 KiB

#!/bin/bash #!/bin/bash
#SBATCH --account=training2436 #SBATCH --account=SOME_ACCOUNT
#SBATCH --nodes=1 #SBATCH --nodes=1
#SBATCH --job-name=ai-multi-gpu #SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1 #SBATCH --ntasks-per-node=1
...@@ -7,7 +7,7 @@ ...@@ -7,7 +7,7 @@
#SBATCH --output=out-distrib.%j #SBATCH --output=out-distrib.%j
#SBATCH --error=err-distrib.%j #SBATCH --error=err-distrib.%j
#SBATCH --time=00:20:00 #SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu #SBATCH --partition=dc-gpu # on JURECA
#SBATCH --gres=gpu:4 #SBATCH --gres=gpu:4
# Without this, srun does not inherit cpus-per-task from sbatch. # Without this, srun does not inherit cpus-per-task from sbatch.
...@@ -23,7 +23,7 @@ export MASTER_PORT=7010 ...@@ -23,7 +23,7 @@ export MASTER_PORT=7010
export GPUS_PER_NODE=4 export GPUS_PER_NODE=4
# Make sure we are on the right directory # Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src cd $HOME/2025-03-talk-nxtaim/src
# This loads modules and python packages # This loads modules and python packages
source sc_venv_template/activate.sh source sc_venv_template/activate.sh
......
#!/bin/bash #!/bin/bash
#SBATCH --account=training2436 #SBATCH --account=SOME_ACCOUNT
#SBATCH --nodes=1 #SBATCH --nodes=1
#SBATCH --job-name=ai-serial #SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1 #SBATCH --ntasks-per-node=1
...@@ -7,11 +7,11 @@ ...@@ -7,11 +7,11 @@
#SBATCH --output=out-serial.%j #SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j #SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00 #SBATCH --time=00:40:00
#SBATCH --partition=dc-gpu #SBATCH --partition=dc-gpu # on JURECA
#SBATCH --gres=gpu:1 #SBATCH --gres=gpu:1
# Make sure we are on the right directory # Make sure we are on the right directory
cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src cd $HOME/2025-03-talk-nxtaim/src
# This loads modules and python packages # This loads modules and python packages
source sc_venv_template/activate.sh source sc_venv_template/activate.sh
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment