diff --git a/README.md b/README.md index 26fee9231c2ead67091dbc3c4a78023c5180b408..9df2e389f653a7a1965505eb042d72b38bffc767 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,9 @@ --- author: Alexandre Strube -title: Getting Started with AI on Supercomputers +title: Overview of AI distribution techniques --- -This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/201) +This repo is specifically for the course described in [indico](https://nxtaim.de/en/events-en/) --- @@ -18,7 +18,7 @@ This repo is specifically for the course described in [indico](https://indico3-j Please, fork this thing! Use it! And submit merge requests! ## Authors and acknowledgment -Alexandre Otto Strube, May 2024 +Alexandre Otto Strube, March 2025 ## Certificate Human resources make them. diff --git a/index.md b/index.md index c5a5853da0c218e0839dfedd69d35bc67b55a2bf..dd2070467b34249df179465eb2985464fb0bb9e8 100644 --- a/index.md +++ b/index.md @@ -2,12 +2,12 @@ author: Alexandre Strube title: Deep Learning on Supercomputers # subtitle: A primer in supercomputers` -date: November 13, 2024 +date: March 13, 2025 --- ## Resources: -- [This document](https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc) -- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc) +- [This document](https://strube1.pages.jsc.fz-juelich.de/2025-03-talk-nxtaim) +- [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim)  @@ -20,10 +20,7 @@ date: November 13, 2024  :::: :::: {.col} - -:::: -:::: {.col} - + :::: ::: @@ -37,7 +34,7 @@ date: November 13, 2024 - on a mult-gpu, multi-node system - like a supercomputer 🤯 - Important: This is an overview, _*NOT*_ a basic AI course! -- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4) +- We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2025/ai-sc-1) -   @@ -47,21 +44,21 @@ date: November 13, 2024 ### Please access it now, so you can follow along: -[https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc) +[https://go.fzj.de/2025-03-nxtaim](https://go.fzj.de/2025-03-nxtaim)  --- -## Git clone this repository +<!-- ## Git clone this repository - All slides and source code - Connect to the supercomputer and do this: - ```bash -git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git +git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.git ``` ---- +--- --> ## Deep learning is... @@ -369,6 +366,22 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco --- +## Fully Sharded Data Parallelism + +- Shards model parameters, optimizer states and gradients across DDP ranks. +- Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations: +- { width=450px } + +--- + +## Fully Sharded Data Parallelism + +- Reduces the memory footprint of each GPU +- Increases the communication volume +- Allows for massive scaling (100000+ GPUs) + +--- + ## Recap - Data parallelism: @@ -386,18 +399,33 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-superco --- +## Recap + +- Pipelining: + - Split the model over multiple GPUs + - Each GPU does a part of the forward pass + - The gradients are averaged at the end +- Pipelining, multi-node: + - Same, but gradients are averaged across nodes +- Fully Sharded Data Parallelism: + - Split the model, optimizer states and gradients across DDP ranks + - Decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations + - Lower memory, higher communication volume + +--- + ## Are we there yet?  --- -## If you haven't done so, please access the slides to clone repository: +## You can clone the repo yourself  - ```bash -git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git +git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.gif ``` @@ -518,7 +546,7 @@ learn.fine_tune(6) - Only add new requirements - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template) - ```bash -cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src +cd $HOME/2025-03-talk-nxtaim/src git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git ``` - Add this to sc_venv_template/requirements.txt: @@ -558,7 +586,7 @@ source sc_venv_template/activate.sh #SBATCH --partition=dc-gpu # Make sure we are on the right directory -cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src +cd $HOME/2025-03-talk-nxtaim/src # This loads modules and python packages source sc_venv_template/activate.sh @@ -579,7 +607,7 @@ time srun python serial.py ## Download dataset ```bash -cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src +cd $HOME/2025-03-talk-nxtaim/src source sc_venv_template/activate.sh python serial.py @@ -599,7 +627,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71 ## Running it - ```bash -cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src +cd $HOME/2025-03-talk-nxtaim/src sbatch serial.slurm ``` - On Juwels Booster, should take about 5 minutes @@ -669,7 +697,7 @@ with learn.distrib_ctx(): ## Submission script: data parallel -- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) +- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim/-/blob/main/src/distrib.slurm) - Main differences: diff --git a/public/README.html b/public/README.html index 35564c4a09f6e3d71ead3a2f64325a79fd5a62f4..c57a9c555ade5f35ce9ab0f6c68d38343ff0b3cf 100644 --- a/public/README.html +++ b/public/README.html @@ -4,7 +4,7 @@ <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="author" content="Alexandre Strube"> - <title>Getting Started with AI on Supercomputers</title> + <title>Overview of AI distribution techniques</title> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui"> @@ -160,14 +160,14 @@ <div class="slides"> <section id="title-slide"> - <h1 class="title">Getting Started with AI on Supercomputers</h1> + <h1 class="title">Overview of AI distribution techniques</h1> <p class="author">Alexandre Strube</p> </section> <section class="slide level2"> <p>This repo is specifically for the course described in <a -href="https://indico3-jsc.fz-juelich.de/event/201">indico</a></p> +href="https://nxtaim.de/en/events-en/">indico</a></p> </section> <section class="slide level2"> @@ -185,7 +185,7 @@ the self-contained HTML files.</li> </section> <section id="authors-and-acknowledgment" class="slide level2"> <h2>Authors and acknowledgment</h2> -<p>Alexandre Otto Strube, May 2024</p> +<p>Alexandre Otto Strube, March 2025</p> </section> <section id="certificate" class="slide level2"> <h2>Certificate</h2> diff --git a/public/images/FSDP-graph-2a.png.webp b/public/images/FSDP-graph-2a.png.webp new file mode 100644 index 0000000000000000000000000000000000000000..5015cd0698d6fe509f1d1fe1015c55b2712f8fa1 Binary files /dev/null and b/public/images/FSDP-graph-2a.png.webp differ diff --git a/public/images/slides.png b/public/images/slides.png index f50cc5c013ae4fcb841dc075014b5c2ce727e643..83d6d1525a43b8b2a81f29cc8bb5dc037b8d22f8 100644 Binary files a/public/images/slides.png and b/public/images/slides.png differ diff --git a/public/index.html b/public/index.html index a6e10a4b2331a33abb06cb1e3c8ff60b7da178a1..4f516436a1ef51c6f91b233edeb19cff0c38222a 100644 --- a/public/index.html +++ b/public/index.html @@ -4,7 +4,7 @@ <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="author" content="Alexandre Strube"> - <meta name="dcterms.date" content="2024-11-13"> + <meta name="dcterms.date" content="2025-03-13"> <title>Deep Learning on Supercomputers</title> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> @@ -227,17 +227,17 @@ <section id="title-slide"> <h1 class="title">Deep Learning on Supercomputers</h1> <p class="author">Alexandre Strube</p> - <p class="date">November 13, 2024</p> + <p class="date">March 13, 2025</p> </section> <section id="resources" class="slide level2"> <h2>Resources:</h2> <ul> <li class="fragment"><a -href="https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc">This +href="https://strube1.pages.jsc.fz-juelich.de/2025-03-talk-nxtaim">This document</a></li> <li class="fragment"><a -href="https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc">Source +href="https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim">Source code of this course</a></li> </ul> <p><img @@ -254,14 +254,8 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></p> </div> <div class="col"> <figure> -<img data-src="pics/ilya.jpg" alt="Ilya Zhukov" /> -<figcaption aria-hidden="true">Ilya Zhukov</figcaption> -</figure> -</div> -<div class="col"> -<figure> -<img data-src="pics/jolanta.jpg" alt="Jolanta Zjupa" /> -<figcaption aria-hidden="true">Jolanta Zjupa</figcaption> +<img data-src="pics/sabrina.jpg" alt="Sabrina Benassou" /> +<figcaption aria-hidden="true">Sabrina Benassou</figcaption> </figure> </div> </div> @@ -279,7 +273,7 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></p> <li class="fragment">Important: This is an overview, <em><em>NOT</em></em> a basic AI course!</li> <li class="fragment">We have <a -href="https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4">introductory +href="https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2025/ai-sc-1">introductory courses on AI on supercomputers</a></li> <li class="fragment"><img data-src="images/bringing-dl-workloads-2024-2-course.png" /> <img @@ -291,17 +285,20 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></li> <h3 id="please-access-it-now-so-you-can-follow-along">Please access it now, so you can follow along:</h3> <p><a -href="https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc">https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc</a></p> +href="https://go.fzj.de/2025-03-nxtaim">https://go.fzj.de/2025-03-nxtaim</a></p> <p><img data-src="images/slides.png" /></p> </section> -<section id="git-clone-this-repository" class="slide level2"> -<h2>Git clone this repository</h2> -<ul> -<li class="fragment">All slides and source code</li> -<li class="fragment">Connect to the supercomputer and do this:</li> -<li class="fragment"><div class="sourceCode" id="cb1"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> -</ul> +<section class="slide level2"> + +<!-- ## Git clone this repository + +- All slides and source code +- Connect to the supercomputer and do this: +- ```bash +git clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.git +``` + +--- --> </section> <section id="deep-learning-is" class="slide level2"> <h2>Deep learning is…</h2> @@ -574,6 +571,25 @@ gpus</li> <li class="fragment">One can even pipeline among nodes….</li> </ul> </section> +<section id="fully-sharded-data-parallelism" class="slide level2"> +<h2>Fully Sharded Data Parallelism</h2> +<ul> +<li class="fragment">Shards model parameters, optimizer states and +gradients across DDP ranks.</li> +<li class="fragment">Decompose the all-reduce operations in DDP into +separate reduce-scatter and all-gather operations:</li> +<li class="fragment"><img data-src="images/FSDP-graph-2a.png.webp" +width="450" /></li> +</ul> +</section> +<section id="fully-sharded-data-parallelism-1" class="slide level2"> +<h2>Fully Sharded Data Parallelism</h2> +<ul> +<li class="fragment">Reduces the memory footprint of each GPU</li> +<li class="fragment">Increases the communication volume</li> +<li class="fragment">Allows for massive scaling (100000+ GPUs)</li> +</ul> +</section> <section id="recap" class="slide level2"> <h2>Recap</h2> <ul> @@ -599,19 +615,39 @@ gpus</li> </ul></li> </ul> </section> +<section id="recap-1" class="slide level2"> +<h2>Recap</h2> +<ul> +<li class="fragment">Pipelining: +<ul> +<li class="fragment">Split the model over multiple GPUs</li> +<li class="fragment">Each GPU does a part of the forward pass</li> +<li class="fragment">The gradients are averaged at the end</li> +</ul></li> +<li class="fragment">Pipelining, multi-node: +<ul> +<li class="fragment">Same, but gradients are averaged across nodes</li> +</ul></li> +<li class="fragment">Fully Sharded Data Parallelism: +<ul> +<li class="fragment">Split the model, optimizer states and gradients +across DDP ranks</li> +<li class="fragment">Decompose the all-reduce operations in DDP into +separate reduce-scatter and all-gather operations</li> +<li class="fragment">Lower memory, higher communication volume</li> +</ul></li> +</ul> +</section> <section id="are-we-there-yet-3" class="slide level2"> <h2>Are we there yet?</h2> <p><img data-src="images/are-we-there-yet-4.gif" /></p> </section> -<section -id="if-you-havent-done-so-please-access-the-slides-to-clone-repository" -class="slide level2"> -<h2>If you haven’t done so, please access the slides to clone -repository:</h2> +<section id="you-can-clone-the-repo-yourself" class="slide level2"> +<h2>You can clone the repo yourself</h2> <p><img data-src="images/slides.png" /></p> <ul> -<li class="fragment"><div class="sourceCode" id="cb2"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> +<li class="fragment"><div class="sourceCode" id="cb1"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim.gif</span></code></pre></div></li> </ul> </section> <section id="demo-time" class="slide level2"> @@ -626,12 +662,33 @@ node</li> </section> <section id="expected-imports" class="slide level2"> <h2>Expected imports</h2> +<div class="sourceCode" id="cb2"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> +<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> +<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.models.xresnet <span class="im">import</span> <span class="op">*</span></span> +<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a></span></code></pre></div> +</section> +<section id="bringing-your-data-in" class="slide level2"> +<h2>Bringing your data in*</h2> <div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.models.xresnet <span class="im">import</span> <span class="op">*</span></span> -<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="co"># DOWNLOADS DATASET - we need to do this on the login node</span></span> +<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> untar_data(URLs.IMAGEWOOF_320) </span> <span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a></span> <span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a></span> <span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a></span> @@ -645,29 +702,29 @@ class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a hr <span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a></span> <span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a></span></code></pre></div> </section> -<section id="bringing-your-data-in" class="slide level2"> -<h2>Bringing your data in*</h2> +<section id="loading-your-data" class="slide level2"> +<h2>Loading your data</h2> <div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> <span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> <span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.models.xresnet <span class="im">import</span> <span class="op">*</span></span> -<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="co"># DOWNLOADS DATASET - we need to do this on the login node</span></span> -<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> untar_data(URLs.IMAGEWOOF_320) </span> -<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> untar_data(URLs.IMAGEWOOF_320)</span> +<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>dls <span class="op">=</span> DataBlock(</span> +<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a> blocks<span class="op">=</span>(ImageBlock, CategoryBlock),</span> +<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a> splitter<span class="op">=</span>GrandparentSplitter(valid_name<span class="op">=</span><span class="st">'val'</span>),</span> +<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a> get_items<span class="op">=</span>get_image_files, get_y<span class="op">=</span>parent_label,</span> +<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a> item_tfms<span class="op">=</span>[RandomResizedCrop(<span class="dv">160</span>), FlipItem(<span class="fl">0.5</span>)],</span> +<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a> batch_tfms<span class="op">=</span>Normalize.from_stats(<span class="op">*</span>imagenet_stats)</span> +<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a>).dataloaders(path, path<span class="op">=</span>path, bs<span class="op">=</span><span class="dv">64</span>)</span> <span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a></span> <span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a></span> <span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a></span> <span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a></span> <span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a></span></code></pre></div> </section> -<section id="loading-your-data" class="slide level2"> -<h2>Loading your data</h2> +<section id="single-gpu-code" class="slide level2"> +<h2>Single-gpu code</h2> <div class="sourceCode" id="cb5"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> @@ -682,30 +739,9 @@ class="sourceCode python"><code class="sourceCode python"><span id="cb5-1"><a hr <span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a> batch_tfms<span class="op">=</span>Normalize.from_stats(<span class="op">*</span>imagenet_stats)</span> <span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a>).dataloaders(path, path<span class="op">=</span>path, bs<span class="op">=</span><span class="dv">64</span>)</span> <span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a>learn <span class="op">=</span> Learner(dls, xresnet50(n_out<span class="op">=</span><span class="dv">10</span>), metrics<span class="op">=</span>[accuracy,top_k_accuracy]).to_fp16()</span> <span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb5-17"><a href="#cb5-17" aria-hidden="true" tabindex="-1"></a></span></code></pre></div> -</section> -<section id="single-gpu-code" class="slide level2"> -<h2>Single-gpu code</h2> -<div class="sourceCode" id="cb6"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> -<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> -<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.models.xresnet <span class="im">import</span> <span class="op">*</span></span> -<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> untar_data(URLs.IMAGEWOOF_320)</span> -<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>dls <span class="op">=</span> DataBlock(</span> -<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a> blocks<span class="op">=</span>(ImageBlock, CategoryBlock),</span> -<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a> splitter<span class="op">=</span>GrandparentSplitter(valid_name<span class="op">=</span><span class="st">'val'</span>),</span> -<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a> get_items<span class="op">=</span>get_image_files, get_y<span class="op">=</span>parent_label,</span> -<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a> item_tfms<span class="op">=</span>[RandomResizedCrop(<span class="dv">160</span>), FlipItem(<span class="fl">0.5</span>)],</span> -<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a> batch_tfms<span class="op">=</span>Normalize.from_stats(<span class="op">*</span>imagenet_stats)</span> -<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a>).dataloaders(path, path<span class="op">=</span>path, bs<span class="op">=</span><span class="dv">64</span>)</span> -<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>learn <span class="op">=</span> Learner(dls, xresnet50(n_out<span class="op">=</span><span class="dv">10</span>), metrics<span class="op">=</span>[accuracy,top_k_accuracy]).to_fp16()</span> -<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a>learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> +<span id="cb5-16"><a href="#cb5-16" aria-hidden="true" tabindex="-1"></a>learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> </section> <section id="venv_template" class="slide level2"> <h2>Venv_template</h2> @@ -718,48 +754,48 @@ modules</li> <li class="fragment"><a href="https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template">Link to gitlab repo</a></li> -<li class="fragment"><div class="sourceCode" id="cb7"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> -<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git</span></code></pre></div></li> +<li class="fragment"><div class="sourceCode" id="cb6"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2025-03-talk-nxtaim/src</span> +<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git</span></code></pre></div></li> <li class="fragment">Add this to sc_venv_template/requirements.txt:</li> +<li class="fragment"><div class="sourceCode" id="cb7"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Add here the pip packages you would like to install on this virtual environment / kernel</span></span> +<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>pip</span> +<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>fastai<span class="op">==</span><span class="fl">2.7.15</span></span> +<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>scipy<span class="op">==</span><span class="fl">1.11.1</span></span> +<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>matplotlib<span class="op">==</span><span class="fl">3.7.2</span></span> +<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>scikit<span class="op">-</span>learn<span class="op">==</span><span class="fl">1.3.1</span></span> +<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>pandas<span class="op">==</span><span class="fl">2.0.3</span></span> +<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>torch<span class="op">==</span><span class="fl">2.1.2</span></span> +<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>accelerate</span></code></pre></div></li> <li class="fragment"><div class="sourceCode" id="cb8"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Add here the pip packages you would like to install on this virtual environment / kernel</span></span> -<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>pip</span> -<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>fastai<span class="op">==</span><span class="fl">2.7.15</span></span> -<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>scipy<span class="op">==</span><span class="fl">1.11.1</span></span> -<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>matplotlib<span class="op">==</span><span class="fl">3.7.2</span></span> -<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>scikit<span class="op">-</span>learn<span class="op">==</span><span class="fl">1.3.1</span></span> -<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>pandas<span class="op">==</span><span class="fl">2.0.3</span></span> -<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>torch<span class="op">==</span><span class="fl">2.1.2</span></span> -<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>accelerate</span></code></pre></div></li> -<li class="fragment"><div class="sourceCode" id="cb9"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="ex">sc_venv_template/setup.sh</span></span> -<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span></code></pre></div></li> +class="sourceCode bash"><code class="sourceCode bash"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="ex">sc_venv_template/setup.sh</span></span> +<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span></code></pre></div></li> <li class="fragment">Done! You installed everything you need</li> </ul> </section> <section id="submission-script" class="slide level2"> <h2>Submission Script</h2> -<div class="sourceCode" id="cb10"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span> -<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=training2436</span></span> -<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --nodes=1</span></span> -<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --job-name=ai-serial</span></span> -<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --ntasks-per-node=1</span></span> -<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=1</span></span> -<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --output=out-serial.%j</span></span> -<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --error=err-serial.%j</span></span> -<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --time=00:40:00</span></span> -<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=dc-gpu</span></span> -<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Make sure we are on the right directory</span></span> -<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> -<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="co"># This loads modules and python packages</span></span> -<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> -<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a><span class="co"># Run the demo</span></span> -<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a><span class="bu">time</span> srun python serial.py</span></code></pre></div> +<div class="sourceCode" id="cb9"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span> +<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=training2436</span></span> +<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --nodes=1</span></span> +<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --job-name=ai-serial</span></span> +<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --ntasks-per-node=1</span></span> +<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=1</span></span> +<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --output=out-serial.%j</span></span> +<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --error=err-serial.%j</span></span> +<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --time=00:40:00</span></span> +<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=dc-gpu</span></span> +<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Make sure we are on the right directory</span></span> +<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2025-03-talk-nxtaim/src</span> +<span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a><span class="co"># This loads modules and python packages</span></span> +<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> +<span id="cb9-17"><a href="#cb9-17" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb9-18"><a href="#cb9-18" aria-hidden="true" tabindex="-1"></a><span class="co"># Run the demo</span></span> +<span id="cb9-19"><a href="#cb9-19" aria-hidden="true" tabindex="-1"></a><span class="bu">time</span> srun python serial.py</span></code></pre></div> </section> <section id="download-dataset" class="slide level2"> <h2>Download dataset</h2> @@ -770,14 +806,14 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href= </section> <section id="download-dataset-1" class="slide level2"> <h2>Download dataset</h2> -<div class="sourceCode" id="cb11"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> -<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> -<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> serial.py</span> -<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="kw">(</span><span class="ex">Some</span> warnings<span class="kw">)</span></span> -<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="ex">Epoch</span> 1/1 : <span class="kw">|</span><span class="ex">-------------------------------------------------------------</span><span class="kw">|</span> <span class="ex">0.71%</span> [1/141 00:07<span class="op"><</span>16:40]</span></code></pre></div> +<div class="sourceCode" id="cb10"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2025-03-talk-nxtaim/src</span> +<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> +<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> serial.py</span> +<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="kw">(</span><span class="ex">Some</span> warnings<span class="kw">)</span></span> +<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> +<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="ex">Epoch</span> 1/1 : <span class="kw">|</span><span class="ex">-------------------------------------------------------------</span><span class="kw">|</span> <span class="ex">0.71%</span> [1/141 00:07<span class="op"><</span>16:40]</span></code></pre></div> <ul> <li class="fragment">It started training, on the login node’s CPUs (WRONG!!!)</li> @@ -788,9 +824,9 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href= <section id="running-it" class="slide level2"> <h2>Running it</h2> <ul> -<li class="fragment"><div class="sourceCode" id="cb12"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> -<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="ex">sbatch</span> serial.slurm</span></code></pre></div></li> +<li class="fragment"><div class="sourceCode" id="cb11"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2025-03-talk-nxtaim/src</span> +<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="ex">sbatch</span> serial.slurm</span></code></pre></div></li> <li class="fragment">On Juwels Booster, should take about 5 minutes</li> <li class="fragment">On a cpu system this would take half a day</li> <li class="fragment">Check the out-serial-XXXXXX and err-serial-XXXXXX @@ -806,78 +842,78 @@ differences</li> </section> <section id="data-parallel-4" class="slide level2"> <h2>Data parallel</h2> -<div class="sourceCode" id="cb13"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> -<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> -<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.models.xresnet <span class="im">import</span> <span class="op">*</span></span> -<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> rank0_first(untar_data, URLs.IMAGEWOOF_320)</span> -<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a>dls <span class="op">=</span> DataBlock(</span> -<span id="cb13-7"><a href="#cb13-7" aria-hidden="true" tabindex="-1"></a> blocks<span class="op">=</span>(ImageBlock, CategoryBlock),</span> -<span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a> splitter<span class="op">=</span>GrandparentSplitter(valid_name<span class="op">=</span><span class="st">'val'</span>),</span> -<span id="cb13-9"><a href="#cb13-9" aria-hidden="true" tabindex="-1"></a> get_items<span class="op">=</span>get_image_files, get_y<span class="op">=</span>parent_label,</span> -<span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a> item_tfms<span class="op">=</span>[RandomResizedCrop(<span class="dv">160</span>), FlipItem(<span class="fl">0.5</span>)],</span> -<span id="cb13-11"><a href="#cb13-11" aria-hidden="true" tabindex="-1"></a> batch_tfms<span class="op">=</span>Normalize.from_stats(<span class="op">*</span>imagenet_stats)</span> -<span id="cb13-12"><a href="#cb13-12" aria-hidden="true" tabindex="-1"></a>).dataloaders(path, path<span class="op">=</span>path, bs<span class="op">=</span><span class="dv">64</span>)</span> -<span id="cb13-13"><a href="#cb13-13" aria-hidden="true" tabindex="-1"></a></span> -<span id="cb13-14"><a href="#cb13-14" aria-hidden="true" tabindex="-1"></a>learn <span class="op">=</span> Learner(dls, xresnet50(n_out<span class="op">=</span><span class="dv">10</span>), metrics<span class="op">=</span>[accuracy,top_k_accuracy]).to_fp16()</span> -<span id="cb13-15"><a href="#cb13-15" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> learn.distrib_ctx():</span> -<span id="cb13-16"><a href="#cb13-16" aria-hidden="true" tabindex="-1"></a> learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> +<div class="sourceCode" id="cb12"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.<span class="bu">all</span> <span class="im">import</span> <span class="op">*</span></span> +<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.distributed <span class="im">import</span> <span class="op">*</span></span> +<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> fastai.vision.models.xresnet <span class="im">import</span> <span class="op">*</span></span> +<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> rank0_first(untar_data, URLs.IMAGEWOOF_320)</span> +<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a>dls <span class="op">=</span> DataBlock(</span> +<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a> blocks<span class="op">=</span>(ImageBlock, CategoryBlock),</span> +<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a> splitter<span class="op">=</span>GrandparentSplitter(valid_name<span class="op">=</span><span class="st">'val'</span>),</span> +<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a> get_items<span class="op">=</span>get_image_files, get_y<span class="op">=</span>parent_label,</span> +<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a> item_tfms<span class="op">=</span>[RandomResizedCrop(<span class="dv">160</span>), FlipItem(<span class="fl">0.5</span>)],</span> +<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a> batch_tfms<span class="op">=</span>Normalize.from_stats(<span class="op">*</span>imagenet_stats)</span> +<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>).dataloaders(path, path<span class="op">=</span>path, bs<span class="op">=</span><span class="dv">64</span>)</span> +<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a></span> +<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a>learn <span class="op">=</span> Learner(dls, xresnet50(n_out<span class="op">=</span><span class="dv">10</span>), metrics<span class="op">=</span>[accuracy,top_k_accuracy]).to_fp16()</span> +<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> learn.distrib_ctx():</span> +<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a> learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> </section> <section id="data-parallel-5" class="slide level2"> <h2>Data Parallel</h2> <h3 id="what-changed">What changed?</h3> <p>It was</p> -<div class="sourceCode" id="cb14"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> untar_data(URLs.IMAGEWOOF_320)</span></code></pre></div> +<div class="sourceCode" id="cb13"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> untar_data(URLs.IMAGEWOOF_320)</span></code></pre></div> <p>Became</p> -<div class="sourceCode" id="cb15"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> rank0_first(untar_data, URLs.IMAGEWOOF_320)</span></code></pre></div> +<div class="sourceCode" id="cb14"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> rank0_first(untar_data, URLs.IMAGEWOOF_320)</span></code></pre></div> <p>It was</p> -<div class="sourceCode" id="cb16"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> +<div class="sourceCode" id="cb15"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> <p>Became</p> -<div class="sourceCode" id="cb17"><pre -class="sourceCode python"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> learn.distrib_ctx():</span> -<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a> learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> +<div class="sourceCode" id="cb16"><pre +class="sourceCode python"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> learn.distrib_ctx():</span> +<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a> learn.fine_tune(<span class="dv">6</span>)</span></code></pre></div> </section> <section id="submission-script-data-parallel" class="slide level2"> <h2>Submission script: data parallel</h2> <ul> <li class="fragment"><p>Please check the course repository: <a -href="https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm">src/distrib.slurm</a></p></li> +href="https://gitlab.jsc.fz-juelich.de/strube1/2025-03-talk-nxtaim/-/blob/main/src/distrib.slurm">src/distrib.slurm</a></p></li> <li class="fragment"><p>Main differences:</p></li> -<li class="fragment"><div class="sourceCode" id="cb18"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=48</span></span> -<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --gres=gpu:4</span></span></code></pre></div></li> +<li class="fragment"><div class="sourceCode" id="cb17"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=48</span></span> +<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --gres=gpu:4</span></span></code></pre></div></li> </ul> </section> <section id="lets-check-the-outputs" class="slide level2"> <h2>Let’s check the outputs!</h2> <h4 id="single-gpu">Single gpu:</h4> +<div class="sourceCode" id="cb18"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> +<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.249933 2.152813 0.225757 0.750573 01:11 </span> +<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> +<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 1.882008 1.895813 0.324510 0.832018 00:44 </span> +<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.837312 1.916380 0.374141 0.845253 00:44 </span> +<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.717144 1.739026 0.378722 0.869941 00:43 </span> +<span id="cb18-7"><a href="#cb18-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.594981 1.637526 0.417664 0.891575 00:44 </span> +<span id="cb18-8"><a href="#cb18-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.460454 1.410519 0.507254 0.920336 00:44 </span> +<span id="cb18-9"><a href="#cb18-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.389946 1.304924 0.538814 0.935862 00:43 </span> +<span id="cb18-10"><a href="#cb18-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 5m44.972s</span></code></pre></div> +<h4 id="multi-gpu">Multi gpu:</h4> <div class="sourceCode" id="cb19"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.249933 2.152813 0.225757 0.750573 01:11 </span> +<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.201540 2.799354 0.202950 0.662513 00:09 </span> <span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 1.882008 1.895813 0.324510 0.832018 00:44 </span> -<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.837312 1.916380 0.374141 0.845253 00:44 </span> -<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.717144 1.739026 0.378722 0.869941 00:43 </span> -<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.594981 1.637526 0.417664 0.891575 00:44 </span> -<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.460454 1.410519 0.507254 0.920336 00:44 </span> -<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.389946 1.304924 0.538814 0.935862 00:43 </span> -<span id="cb19-10"><a href="#cb19-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 5m44.972s</span></code></pre></div> -<h4 id="multi-gpu">Multi gpu:</h4> -<div class="sourceCode" id="cb20"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.201540 2.799354 0.202950 0.662513 00:09 </span> -<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 1.951004 2.059517 0.294761 0.781282 00:08 </span> -<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.929561 1.999069 0.309512 0.792981 00:08 </span> -<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.854629 1.962271 0.314344 0.840285 00:08 </span> -<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.754019 1.687136 0.404883 0.872330 00:08 </span> -<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.643759 1.499526 0.482706 0.906409 00:08 </span> -<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.554356 1.450976 0.502798 0.914547 00:08 </span> -<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 1m19.979s</span></code></pre></div> +<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 1.951004 2.059517 0.294761 0.781282 00:08 </span> +<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.929561 1.999069 0.309512 0.792981 00:08 </span> +<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.854629 1.962271 0.314344 0.840285 00:08 </span> +<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.754019 1.687136 0.404883 0.872330 00:08 </span> +<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.643759 1.499526 0.482706 0.906409 00:08 </span> +<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.554356 1.450976 0.502798 0.914547 00:08 </span> +<span id="cb19-10"><a href="#cb19-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 1m19.979s</span></code></pre></div> </section> <section id="some-insights" class="slide level2"> <h2>Some insights</h2> @@ -911,17 +947,17 @@ submission file!</li> <section id="multi-node-1" class="slide level2"> <h2>Multi-node</h2> <ul> -<li class="fragment"><div class="sourceCode" id="cb21"><pre -class="sourceCode bash"><code class="sourceCode bash"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.242036 2.192690 0.201728 0.681148 00:10 </span> -<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.035004 2.084082 0.246189 0.748984 00:05 </span> -<span id="cb21-5"><a href="#cb21-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.981432 2.054528 0.247205 0.764482 00:05 </span> -<span id="cb21-6"><a href="#cb21-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.942930 1.918441 0.316057 0.821138 00:05 </span> -<span id="cb21-7"><a href="#cb21-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.898426 1.832725 0.370173 0.839431 00:05 </span> -<span id="cb21-8"><a href="#cb21-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.859066 1.781805 0.375508 0.858740 00:05 </span> -<span id="cb21-9"><a href="#cb21-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.820968 1.743448 0.394055 0.864583 00:05</span> -<span id="cb21-10"><a href="#cb21-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 1m15.651s </span></code></pre></div></li> +<li class="fragment"><div class="sourceCode" id="cb20"><pre +class="sourceCode bash"><code class="sourceCode bash"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> +<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.242036 2.192690 0.201728 0.681148 00:10 </span> +<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> +<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.035004 2.084082 0.246189 0.748984 00:05 </span> +<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.981432 2.054528 0.247205 0.764482 00:05 </span> +<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.942930 1.918441 0.316057 0.821138 00:05 </span> +<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.898426 1.832725 0.370173 0.839431 00:05 </span> +<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.859066 1.781805 0.375508 0.858740 00:05 </span> +<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.820968 1.743448 0.394055 0.864583 00:05</span> +<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 1m15.651s </span></code></pre></div></li> </ul> </section> <section id="some-insights-1" class="slide level2"> diff --git a/public/pics/sabrina.jpg b/public/pics/sabrina.jpg new file mode 100644 index 0000000000000000000000000000000000000000..6abef7c6fb045617c0f006f4f21c7d327873a283 Binary files /dev/null and b/public/pics/sabrina.jpg differ diff --git a/src/distrib.slurm b/src/distrib.slurm index 6ffdae464ff552ddc87c4a508d98ca01940bee07..9ef2f06da01fd4b01c4680a21c300b9b800cae49 100644 --- a/src/distrib.slurm +++ b/src/distrib.slurm @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=training2436 +#SBATCH --account=SOME_ACCOUNT #SBATCH --nodes=1 #SBATCH --job-name=ai-multi-gpu #SBATCH --ntasks-per-node=1 @@ -7,7 +7,7 @@ #SBATCH --output=out-distrib.%j #SBATCH --error=err-distrib.%j #SBATCH --time=00:20:00 -#SBATCH --partition=dc-gpu +#SBATCH --partition=dc-gpu # on JURECA #SBATCH --gres=gpu:4 # Without this, srun does not inherit cpus-per-task from sbatch. @@ -23,7 +23,7 @@ export MASTER_PORT=7010 export GPUS_PER_NODE=4 # Make sure we are on the right directory -cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src +cd $HOME/2025-03-talk-nxtaim/src # This loads modules and python packages source sc_venv_template/activate.sh diff --git a/src/serial.slurm b/src/serial.slurm index 53ed5379db7c51cd2efdb3afdc4480cdcef726d4..0400182b5587d925bbdf8711eb67049bd05ac421 100644 --- a/src/serial.slurm +++ b/src/serial.slurm @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=training2436 +#SBATCH --account=SOME_ACCOUNT #SBATCH --nodes=1 #SBATCH --job-name=ai-serial #SBATCH --ntasks-per-node=1 @@ -7,11 +7,11 @@ #SBATCH --output=out-serial.%j #SBATCH --error=err-serial.%j #SBATCH --time=00:40:00 -#SBATCH --partition=dc-gpu +#SBATCH --partition=dc-gpu # on JURECA #SBATCH --gres=gpu:1 # Make sure we are on the right directory -cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src +cd $HOME/2025-03-talk-nxtaim/src # This loads modules and python packages source sc_venv_template/activate.sh