Update course materials and scripts for March 2025 event

faeb5b8f · Alexandre Strube · 37eede42 · faeb5b8f · faeb5b8f · faeb5b8f
Commit faeb5b8f authored 4 months ago by Alexandre Strube
--- a/README.md
+++ b/README.md
 ---
 author: Alexandre Strube
-title: Getting Started with AI on Supercomputers
+title: Overview of AI distribution techniques
 ---

-This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/201)
+This repo is specifically for the course described in [indico](https://nxtaim.de/en/events-en/)

 ---

@@ -18,7 +18,7 @@ This repo is specifically for the course described in [indico](https://indico3-j
 Please, fork this thing! Use it! And submit merge requests!

 ## Authors and acknowledgment
-Alexandre Otto Strube, May 2024
+Alexandre Otto Strube, March 2025

 ## Certificate
 Human resources make them.

--- a/index.md
+++ b/index.md
@@ -2,12 +2,12 @@
 author: Alexandre Strube
 title: Deep Learning on Supercomputers
 # subtitle: A primer in supercomputers`
-date: November 13, 2024
+date: March 13, 2025
 ---
 ## Resources:

- [This document](https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc)
- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc)
+- [This document](https://strube1.pages.jsc.fz-juelich.de/2025-03-talk-nxtaim)
+- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim)

 ![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg)

@@ -20,10 +20,7 @@ date: November 13, 2024
 ![Alexandre Strube](pics/alex.jpg)
 ::::
 :::: {.col}
-![Ilya Zhukov](pics/ilya.jpg)
-::::
-:::: {.col}
-![Jolanta Zjupa](pics/jolanta.jpg)
+![Sabrina Benassou](pics/sabrina.jpg)
 ::::
 :::

@@ -37,7 +34,7 @@ date: November 13, 2024
    - on a mult-gpu, multi-node system
    - like a supercomputer 🤯
 - Important: This is an overview, _*NOT*_ a basic AI course!
- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4)
+- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2025/ai-sc-1)
 - ![](images/bringing-dl-workloads-2024-2-course.png)
 ![](images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg)

@@ -47,21 +44,21 @@ date: November 13, 2024

 ### Please access it now, so you can follow along:

-[https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc)
+[https://go.fzj.de/2025-03-nxtaim](https://go.fzj.de/2025-03-nxtaim)

 ![](images/slides.png)

 ---

-## Git clone this repository
+<!-- ## Git clone this repository

 - All slides and source code
 - Connect to the supercomputer and do this:
 - ```bash
-git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
+git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.git
 ```

---
+--- -->

 ## Deep learning is...

@@ -369,6 +366,22 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco

 ---

+## Fully Sharded Data Parallelism
+
+- Shards model parameters, optimizer states and gradients across DDP ranks.
+- Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations:
+- ![](images/FSDP-graph-2a.png.webp){ width=450px }
+
+---
+
+## Fully Sharded Data Parallelism
+
+- Reduces the memory footprint of each GPU
+- Increases the communication volume
+- Allows for massive scaling (100000+ GPUs)
+
+---
+
 ## Recap

 - Data parallelism:
@@ -386,18 +399,33 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco

 ---

+## Recap
+
+- Pipelining:
+    - Split the model over multiple GPUs
+    - Each GPU does a part of the forward pass
+    - The gradients are averaged at the end
+- Pipelining, multi-node:
+    - Same, but gradients are averaged across nodes
+- Fully Sharded Data Parallelism:
+    - Split the model, optimizer states and gradients across DDP ranks
+    - Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations 
+    - Lower memory, higher communication volume
+        
+---
+
 ## Are we there yet?

 ![](images/are-we-there-yet-4.gif)

 ---

-## If you haven't done so, please access the slides to clone repository:
+## You can clone the repo yourself

 ![](images/slides.png)

 - ```bash
-git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git
+git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.gif
 ```


@@ -518,7 +546,7 @@ learn.fine_tune(6)
 - Only add new requirements
 - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template)
 - ```bash
-cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
+cd $HOME/2025-03-talk-nxtaim/src
 git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
 ```
 - Add this to sc_venv_template/requirements.txt:
@@ -558,7 +586,7 @@ source sc_venv_template/activate.sh
 #SBATCH --partition=dc-gpu

 # Make sure we are on the right directory
-cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
+cd $HOME/2025-03-talk-nxtaim/src

 # This loads modules and python packages
 source sc_venv_template/activate.sh
@@ -579,7 +607,7 @@ time srun python serial.py
 ## Download dataset

 ```bash
-cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
+cd $HOME/2025-03-talk-nxtaim/src
 source sc_venv_template/activate.sh
 python serial.py

@@ -599,7 +627,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71
 ## Running it

 - ```bash
-cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
+cd $HOME/2025-03-talk-nxtaim/src
 sbatch serial.slurm
 ```
 - On Juwels Booster, should take about 5 minutes
@@ -669,7 +697,7 @@ with learn.distrib_ctx():

 ## Submission script: data parallel

- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm)
+- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim/-/blob/main/src/distrib.slurm)

 - Main differences: 


--- a/public/README.html
+++ b/public/README.html
@@ -4,7 +4,7 @@
  <meta charset="utf-8">
  <meta name="generator" content="pandoc">
  <meta name="author" content="Alexandre Strube">
-  <title>Getting Started with AI on Supercomputers</title>
+  <title>Overview of AI distribution techniques</title>
  <meta name="apple-mobile-web-app-capable" content="yes">
  <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
@@ -160,14 +160,14 @@
    <div class="slides">

 <section id="title-slide">
-  <h1 class="title">Getting Started with AI on Supercomputers</h1>
+  <h1 class="title">Overview of AI distribution techniques</h1>
  <p class="author">Alexandre Strube</p>
 </section>

 <section class="slide level2">

 <p>This repo is specifically for the course described in <a
-href="https://indico3-jsc.fz-juelich.de/event/201">indico</a></p>
+href="https://nxtaim.de/en/events-en/">indico</a></p>
 </section>
 <section class="slide level2">

@@ -185,7 +185,7 @@ the self-contained HTML files.</li>
 </section>
 <section id="authors-and-acknowledgment" class="slide level2">
 <h2>Authors and acknowledgment</h2>
-<p>Alexandre Otto Strube, May 2024</p>
+<p>Alexandre Otto Strube, March 2025</p>
 </section>
 <section id="certificate" class="slide level2">
 <h2>Certificate</h2>

--- a/public/images/FSDP-graph-2a.png.webp
+++ b/public/images/FSDP-graph-2a.png.webp
--- a/public/images/slides.png
+++ b/public/images/slides.png
--- a/public/index.html
+++ b/public/index.html
--- a/public/pics/sabrina.jpg
+++ b/public/pics/sabrina.jpg
--- a/src/distrib.slurm
+++ b/src/distrib.slurm
 #!/bin/bash
-#SBATCH --account=training2436
+#SBATCH --account=SOME_ACCOUNT
 #SBATCH --nodes=1
 #SBATCH --job-name=ai-multi-gpu
 #SBATCH --ntasks-per-node=1
@@ -7,7 +7,7 @@
 #SBATCH --output=out-distrib.%j
 #SBATCH --error=err-distrib.%j
 #SBATCH --time=00:20:00
-#SBATCH --partition=dc-gpu
+#SBATCH --partition=dc-gpu # on JURECA
 #SBATCH --gres=gpu:4

 # Without this, srun does not inherit cpus-per-task from sbatch.
@@ -23,7 +23,7 @@ export MASTER_PORT=7010
 export GPUS_PER_NODE=4

 # Make sure we are on the right directory
-cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
+cd $HOME/2025-03-talk-nxtaim/src

 # This loads modules and python packages
 source sc_venv_template/activate.sh

--- a/src/serial.slurm
+++ b/src/serial.slurm
 #!/bin/bash
-#SBATCH --account=training2436
+#SBATCH --account=SOME_ACCOUNT
 #SBATCH --nodes=1
 #SBATCH --job-name=ai-serial
 #SBATCH --ntasks-per-node=1
@@ -7,11 +7,11 @@
 #SBATCH --output=out-serial.%j
 #SBATCH --error=err-serial.%j
 #SBATCH --time=00:40:00
-#SBATCH --partition=dc-gpu
+#SBATCH --partition=dc-gpu # on JURECA
 #SBATCH --gres=gpu:1

 # Make sure we are on the right directory
-cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
+cd $HOME/2025-03-talk-nxtaim/src

 # This loads modules and python packages
 source sc_venv_template/activate.sh