6 files + 58 − 49 Inline Compare changes Side-by-side Inline Show whitespace changes Files 6 README.md +1 −2 Original line number Diff line number Diff line Loading @@ -3,8 +3,7 @@ author: Alexandre Strube title: Getting Started with AI on Supercomputers --- The older repo was https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/dl_on_supercomputers This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/107) This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/201) --- Loading index.md +23 −17 Original line number Diff line number Diff line Loading @@ -2,12 +2,12 @@ author: Alexandre Strube title: Deep Learning on Supercomputers # subtitle: A primer in supercomputers` date: May 23, 2024 date: November 13, 2024 --- ## Resources: - [This document](https://strube1.pages.jsc.fz-juelich.de/2024-05-talk-intro-to-supercompting-jsc) - [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc) - [This document](https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc) - [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc)  Loading Loading @@ -37,7 +37,7 @@ date: May 23, 2024 - on a mult-gpu, multi-node system - like a supercomputer 🤯 - Important: This is an overview, _*NOT*_ a basic AI course! - We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-2) - We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4) -   Loading @@ -47,7 +47,7 @@ date: May 23, 2024 ### Please access it now, so you can follow along: [https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc) [https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc)  Loading @@ -58,7 +58,7 @@ date: May 23, 2024 - All slides and source code - Connect to the supercomputer and do this: - ```bash git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git ``` --- Loading Loading @@ -397,7 +397,7 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-superco  - ```bash git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git ``` Loading Loading @@ -518,13 +518,19 @@ learn.fine_tune(6) - Only add new requirements - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template) - ```bash cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git ``` - Add this to sc_venv_template/requirements.txt: - ```python fastai deepspeed # Add here the pip packages you would like to install on this virtual environment / kernel pip fastai==2.7.15 scipy==1.11.1 matplotlib==3.7.2 scikit-learn==1.3.1 pandas==2.0.3 torch==2.1.2 accelerate ``` Loading @@ -541,7 +547,7 @@ source sc_venv_template/activate.sh ```bash #!/bin/bash #SBATCH --account=training2410 #SBATCH --account=training2436 #SBATCH --nodes=1 #SBATCH --job-name=ai-serial #SBATCH --ntasks-per-node=1 Loading @@ -549,10 +555,10 @@ source sc_venv_template/activate.sh #SBATCH --output=out-serial.%j #SBATCH --error=err-serial.%j #SBATCH --time=00:40:00 #SBATCH --partition=develbooster #SBATCH --partition=dc-gpu # Make sure we are on the right directory cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src # This loads modules and python packages source sc_venv_template/activate.sh Loading @@ -573,7 +579,7 @@ time srun python serial.py ## Download dataset ```bash cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src source sc_venv_template/activate.sh python serial.py Loading @@ -593,7 +599,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71 ## Running it - ```bash cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src sbatch serial.slurm ``` - On Juwels Booster, should take about 5 minutes Loading Loading @@ -663,7 +669,7 @@ with learn.distrib_ctx(): ## Submission script: data parallel - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) - Main differences: Loading Loading @@ -767,6 +773,6 @@ real 1m15.651s ## References - [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/pipeline.html) - [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/distributed.pipelining.html) - [Intro to Distributed Deep Learning](https://xiandong79.github.io/Intro-Distributed-Deep-Learning) - [Model Parallelism - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html) No newline at end of file public/README.html +2 −4 Original line number Diff line number Diff line Loading @@ -166,10 +166,8 @@ <section class="slide level2"> <p>The older repo was https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/dl_on_supercomputers This repo is specifically for the course described in <a href="https://indico3-jsc.fz-juelich.de/event/107">indico</a></p> <p>This repo is specifically for the course described in <a href="https://indico3-jsc.fz-juelich.de/event/201">indico</a></p> </section> <section class="slide level2"> Loading public/index.html +26 −20 Original line number Diff line number Diff line Loading @@ -4,7 +4,7 @@ <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="author" content="Alexandre Strube"> <meta name="dcterms.date" content="2024-05-23"> <meta name="dcterms.date" content="2024-11-13"> <title>Deep Learning on Supercomputers</title> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> Loading Loading @@ -227,17 +227,17 @@ <section id="title-slide"> <h1 class="title">Deep Learning on Supercomputers</h1> <p class="author">Alexandre Strube</p> <p class="date">May 23, 2024</p> <p class="date">November 13, 2024</p> </section> <section id="resources" class="slide level2"> <h2>Resources:</h2> <ul> <li class="fragment"><a href="https://strube1.pages.jsc.fz-juelich.de/2024-05-talk-intro-to-supercompting-jsc">This href="https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc">This document</a></li> <li class="fragment"><a href="https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc">Source href="https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc">Source code of this course</a></li> </ul> <p><img Loading Loading @@ -279,7 +279,7 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></p> <li class="fragment">Important: This is an overview, <em><em>NOT</em></em> a basic AI course!</li> <li class="fragment">We have <a href="https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-2">introductory href="https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4">introductory courses on AI on supercomputers</a></li> <li class="fragment"><img data-src="images/bringing-dl-workloads-2024-2-course.png" /> <img Loading @@ -291,7 +291,7 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></li> <h3 id="please-access-it-now-so-you-can-follow-along">Please access it now, so you can follow along:</h3> <p><a href="https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc">https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc</a></p> href="https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc">https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc</a></p> <p><img data-src="images/slides.png" /></p> </section> <section id="git-clone-this-repository" class="slide level2"> Loading @@ -300,7 +300,7 @@ href="https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc">https://go.fzj <li class="fragment">All slides and source code</li> <li class="fragment">Connect to the supercomputer and do this:</li> <li class="fragment"><div class="sourceCode" id="cb1"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> </ul> </section> <section id="deep-learning-is" class="slide level2"> Loading Loading @@ -611,7 +611,7 @@ repository:</h2> <p><img data-src="images/slides.png" /></p> <ul> <li class="fragment"><div class="sourceCode" id="cb2"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> </ul> </section> <section id="demo-time" class="slide level2"> Loading Loading @@ -719,13 +719,19 @@ modules</li> href="https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template">Link to gitlab repo</a></li> <li class="fragment"><div class="sourceCode" id="cb7"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git</span></code></pre></div></li> <li class="fragment">Add this to sc_venv_template/requirements.txt:</li> <li class="fragment"><div class="sourceCode" id="cb8"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>fastai</span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>deepspeed</span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>accelerate</span></code></pre></div></li> class="sourceCode python"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Add here the pip packages you would like to install on this virtual environment / kernel</span></span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>pip</span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>fastai<span class="op">==</span><span class="fl">2.7.15</span></span> <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>scipy<span class="op">==</span><span class="fl">1.11.1</span></span> <span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>matplotlib<span class="op">==</span><span class="fl">3.7.2</span></span> <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>scikit<span class="op">-</span>learn<span class="op">==</span><span class="fl">1.3.1</span></span> <span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>pandas<span class="op">==</span><span class="fl">2.0.3</span></span> <span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>torch<span class="op">==</span><span class="fl">2.1.2</span></span> <span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>accelerate</span></code></pre></div></li> <li class="fragment"><div class="sourceCode" id="cb9"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="ex">sc_venv_template/setup.sh</span></span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span></code></pre></div></li> Loading @@ -736,7 +742,7 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href=" <h2>Submission Script</h2> <div class="sourceCode" id="cb10"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=training2410</span></span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=training2436</span></span> <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --nodes=1</span></span> <span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --job-name=ai-serial</span></span> <span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --ntasks-per-node=1</span></span> Loading @@ -744,10 +750,10 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href= <span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --output=out-serial.%j</span></span> <span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --error=err-serial.%j</span></span> <span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --time=00:40:00</span></span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=develbooster</span></span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=dc-gpu</span></span> <span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Make sure we are on the right directory</span></span> <span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> <span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="co"># This loads modules and python packages</span></span> <span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> Loading @@ -765,7 +771,7 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href= <section id="download-dataset-1" class="slide level2"> <h2>Download dataset</h2> <div class="sourceCode" id="cb11"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> serial.py</span> <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span> Loading @@ -783,7 +789,7 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href= <h2>Running it</h2> <ul> <li class="fragment"><div class="sourceCode" id="cb12"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="ex">sbatch</span> serial.slurm</span></code></pre></div></li> <li class="fragment">On Juwels Booster, should take about 5 minutes</li> <li class="fragment">On a cpu system this would take half a day</li> Loading Loading @@ -839,7 +845,7 @@ class="sourceCode python"><code class="sourceCode python"><span id="cb17-1"><a h <h2>Submission script: data parallel</h2> <ul> <li class="fragment"><p>Please check the course repository: <a href="https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm">src/distrib.slurm</a></p></li> href="https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm">src/distrib.slurm</a></p></li> <li class="fragment"><p>Main differences:</p></li> <li class="fragment"><div class="sourceCode" id="cb18"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=48</span></span> Loading Loading @@ -951,8 +957,8 @@ DL scaling course at JSC!</li> <h2>References</h2> <ul> <li class="fragment"><a href="https://pytorch.org/docs/stable/pipeline.html">Pytorch Model Parallelism and Pipelining</a></li> href="https://pytorch.org/docs/stable/distributed.pipelining.html">Pytorch Model Parallelism and Pipelining</a></li> <li class="fragment"><a href="https://xiandong79.github.io/Intro-Distributed-Deep-Learning">Intro to Distributed Deep Learning</a></li> Loading src/distrib.slurm +3 −3 Original line number Diff line number Diff line #!/bin/bash #SBATCH --account=training2410 #SBATCH --account=training2436 #SBATCH --nodes=1 #SBATCH --job-name=ai-multi-gpu #SBATCH --ntasks-per-node=1 Loading @@ -7,7 +7,7 @@ #SBATCH --output=out-distrib.%j #SBATCH --error=err-distrib.%j #SBATCH --time=00:20:00 #SBATCH --partition=booster #SBATCH --partition=dc-gpu #SBATCH --gres=gpu:4 # Without this, srun does not inherit cpus-per-task from sbatch. Loading @@ -23,7 +23,7 @@ export MASTER_PORT=7010 export GPUS_PER_NODE=4 # Make sure we are on the right directory cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src # This loads modules and python packages source sc_venv_template/activate.sh Loading src/serial.slurm +3 −3 Original line number Diff line number Diff line #!/bin/bash #SBATCH --account=training2410 #SBATCH --account=training2436 #SBATCH --nodes=1 #SBATCH --job-name=ai-serial #SBATCH --ntasks-per-node=1 Loading @@ -7,11 +7,11 @@ #SBATCH --output=out-serial.%j #SBATCH --error=err-serial.%j #SBATCH --time=00:40:00 #SBATCH --partition=develbooster #SBATCH --partition=dc-gpu #SBATCH --gres=gpu:1 # Make sure we are on the right directory cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src # This loads modules and python packages source sc_venv_template/activate.sh Loading
README.md +1 −2 Original line number Diff line number Diff line Loading @@ -3,8 +3,7 @@ author: Alexandre Strube title: Getting Started with AI on Supercomputers --- The older repo was https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/dl_on_supercomputers This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/107) This repo is specifically for the course described in [indico](https://indico3-jsc.fz-juelich.de/event/201) --- Loading
index.md +23 −17 Original line number Diff line number Diff line Loading @@ -2,12 +2,12 @@ author: Alexandre Strube title: Deep Learning on Supercomputers # subtitle: A primer in supercomputers` date: May 23, 2024 date: November 13, 2024 --- ## Resources: - [This document](https://strube1.pages.jsc.fz-juelich.de/2024-05-talk-intro-to-supercompting-jsc) - [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc) - [This document](https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc) - [Source code of this course](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc)  Loading Loading @@ -37,7 +37,7 @@ date: May 23, 2024 - on a mult-gpu, multi-node system - like a supercomputer 🤯 - Important: This is an overview, _*NOT*_ a basic AI course! - We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-2) - We have [introductory courses on AI on supercomputers](https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4) -   Loading @@ -47,7 +47,7 @@ date: May 23, 2024 ### Please access it now, so you can follow along: [https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc) [https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc](https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc)  Loading @@ -58,7 +58,7 @@ date: May 23, 2024 - All slides and source code - Connect to the supercomputer and do this: - ```bash git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git ``` --- Loading Loading @@ -397,7 +397,7 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-superco  - ```bash git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git ``` Loading Loading @@ -518,13 +518,19 @@ learn.fine_tune(6) - Only add new requirements - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template) - ```bash cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git ``` - Add this to sc_venv_template/requirements.txt: - ```python fastai deepspeed # Add here the pip packages you would like to install on this virtual environment / kernel pip fastai==2.7.15 scipy==1.11.1 matplotlib==3.7.2 scikit-learn==1.3.1 pandas==2.0.3 torch==2.1.2 accelerate ``` Loading @@ -541,7 +547,7 @@ source sc_venv_template/activate.sh ```bash #!/bin/bash #SBATCH --account=training2410 #SBATCH --account=training2436 #SBATCH --nodes=1 #SBATCH --job-name=ai-serial #SBATCH --ntasks-per-node=1 Loading @@ -549,10 +555,10 @@ source sc_venv_template/activate.sh #SBATCH --output=out-serial.%j #SBATCH --error=err-serial.%j #SBATCH --time=00:40:00 #SBATCH --partition=develbooster #SBATCH --partition=dc-gpu # Make sure we are on the right directory cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src # This loads modules and python packages source sc_venv_template/activate.sh Loading @@ -573,7 +579,7 @@ time srun python serial.py ## Download dataset ```bash cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src source sc_venv_template/activate.sh python serial.py Loading @@ -593,7 +599,7 @@ Epoch 1/1 : |-------------------------------------------------------------| 0.71 ## Running it - ```bash cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src sbatch serial.slurm ``` - On Juwels Booster, should take about 5 minutes Loading Loading @@ -663,7 +669,7 @@ with learn.distrib_ctx(): ## Submission script: data parallel - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) - Main differences: Loading Loading @@ -767,6 +773,6 @@ real 1m15.651s ## References - [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/pipeline.html) - [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/distributed.pipelining.html) - [Intro to Distributed Deep Learning](https://xiandong79.github.io/Intro-Distributed-Deep-Learning) - [Model Parallelism - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html) No newline at end of file
public/README.html +2 −4 Original line number Diff line number Diff line Loading @@ -166,10 +166,8 @@ <section class="slide level2"> <p>The older repo was https://gitlab.jsc.fz-juelich.de/MLDL_FZJ/juhaicu/jsc_public/sharedspace/teaching/dl_on_supercomputers This repo is specifically for the course described in <a href="https://indico3-jsc.fz-juelich.de/event/107">indico</a></p> <p>This repo is specifically for the course described in <a href="https://indico3-jsc.fz-juelich.de/event/201">indico</a></p> </section> <section class="slide level2"> Loading
public/index.html +26 −20 Original line number Diff line number Diff line Loading @@ -4,7 +4,7 @@ <meta charset="utf-8"> <meta name="generator" content="pandoc"> <meta name="author" content="Alexandre Strube"> <meta name="dcterms.date" content="2024-05-23"> <meta name="dcterms.date" content="2024-11-13"> <title>Deep Learning on Supercomputers</title> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent"> Loading Loading @@ -227,17 +227,17 @@ <section id="title-slide"> <h1 class="title">Deep Learning on Supercomputers</h1> <p class="author">Alexandre Strube</p> <p class="date">May 23, 2024</p> <p class="date">November 13, 2024</p> </section> <section id="resources" class="slide level2"> <h2>Resources:</h2> <ul> <li class="fragment"><a href="https://strube1.pages.jsc.fz-juelich.de/2024-05-talk-intro-to-supercompting-jsc">This href="https://strube1.pages.jsc.fz-juelich.de/2024-11-talk-intro-to-supercompting-jsc">This document</a></li> <li class="fragment"><a href="https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc">Source href="https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc">Source code of this course</a></li> </ul> <p><img Loading Loading @@ -279,7 +279,7 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></p> <li class="fragment">Important: This is an overview, <em><em>NOT</em></em> a basic AI course!</li> <li class="fragment">We have <a href="https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-2">introductory href="https://www.fz-juelich.de/en/ias/jsc/news/events/training-courses/2024/ai-sc-4">introductory courses on AI on supercomputers</a></li> <li class="fragment"><img data-src="images/bringing-dl-workloads-2024-2-course.png" /> <img Loading @@ -291,7 +291,7 @@ data-src="images/Logo_FZ_Juelich_rgb_Schutzzone_transparent.svg" /></li> <h3 id="please-access-it-now-so-you-can-follow-along">Please access it now, so you can follow along:</h3> <p><a href="https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc">https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc</a></p> href="https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc">https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc</a></p> <p><img data-src="images/slides.png" /></p> </section> <section id="git-clone-this-repository" class="slide level2"> Loading @@ -300,7 +300,7 @@ href="https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc">https://go.fzj <li class="fragment">All slides and source code</li> <li class="fragment">Connect to the supercomputer and do this:</li> <li class="fragment"><div class="sourceCode" id="cb1"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> class="sourceCode bash"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> </ul> </section> <section id="deep-learning-is" class="slide level2"> Loading Loading @@ -611,7 +611,7 @@ repository:</h2> <p><img data-src="images/slides.png" /></p> <ul> <li class="fragment"><div class="sourceCode" id="cb2"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git</span></code></pre></div></li> </ul> </section> <section id="demo-time" class="slide level2"> Loading Loading @@ -719,13 +719,19 @@ modules</li> href="https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template">Link to gitlab repo</a></li> <li class="fragment"><div class="sourceCode" id="cb7"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git</span></code></pre></div></li> <li class="fragment">Add this to sc_venv_template/requirements.txt:</li> <li class="fragment"><div class="sourceCode" id="cb8"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>fastai</span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>deepspeed</span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>accelerate</span></code></pre></div></li> class="sourceCode python"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Add here the pip packages you would like to install on this virtual environment / kernel</span></span> <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>pip</span> <span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>fastai<span class="op">==</span><span class="fl">2.7.15</span></span> <span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>scipy<span class="op">==</span><span class="fl">1.11.1</span></span> <span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>matplotlib<span class="op">==</span><span class="fl">3.7.2</span></span> <span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>scikit<span class="op">-</span>learn<span class="op">==</span><span class="fl">1.3.1</span></span> <span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>pandas<span class="op">==</span><span class="fl">2.0.3</span></span> <span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a>torch<span class="op">==</span><span class="fl">2.1.2</span></span> <span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>accelerate</span></code></pre></div></li> <li class="fragment"><div class="sourceCode" id="cb9"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="ex">sc_venv_template/setup.sh</span></span> <span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span></code></pre></div></li> Loading @@ -736,7 +742,7 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href=" <h2>Submission Script</h2> <div class="sourceCode" id="cb10"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">#!/bin/bash</span></span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=training2410</span></span> <span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --account=training2436</span></span> <span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --nodes=1</span></span> <span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --job-name=ai-serial</span></span> <span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --ntasks-per-node=1</span></span> Loading @@ -744,10 +750,10 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href= <span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --output=out-serial.%j</span></span> <span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --error=err-serial.%j</span></span> <span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --time=00:40:00</span></span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=develbooster</span></span> <span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --partition=dc-gpu</span></span> <span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Make sure we are on the right directory</span></span> <span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> <span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a></span> <span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="co"># This loads modules and python packages</span></span> <span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> Loading @@ -765,7 +771,7 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb10-1"><a href= <section id="download-dataset-1" class="slide level2"> <h2>Download dataset</h2> <div class="sourceCode" id="cb11"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="bu">source</span> sc_venv_template/activate.sh</span> <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="ex">python</span> serial.py</span> <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span> Loading @@ -783,7 +789,7 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href= <h2>Running it</h2> <ul> <li class="fragment"><div class="sourceCode" id="cb12"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-05-talk-intro-to-supercompting-jsc/src</span> class="sourceCode bash"><code class="sourceCode bash"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="bu">cd</span> <span class="va">$HOME</span>/2024-11-talk-intro-to-supercompting-jsc/src</span> <span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="ex">sbatch</span> serial.slurm</span></code></pre></div></li> <li class="fragment">On Juwels Booster, should take about 5 minutes</li> <li class="fragment">On a cpu system this would take half a day</li> Loading Loading @@ -839,7 +845,7 @@ class="sourceCode python"><code class="sourceCode python"><span id="cb17-1"><a h <h2>Submission script: data parallel</h2> <ul> <li class="fragment"><p>Please check the course repository: <a href="https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm">src/distrib.slurm</a></p></li> href="https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm">src/distrib.slurm</a></p></li> <li class="fragment"><p>Main differences:</p></li> <li class="fragment"><div class="sourceCode" id="cb18"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="co">#SBATCH --cpus-per-task=48</span></span> Loading Loading @@ -951,8 +957,8 @@ DL scaling course at JSC!</li> <h2>References</h2> <ul> <li class="fragment"><a href="https://pytorch.org/docs/stable/pipeline.html">Pytorch Model Parallelism and Pipelining</a></li> href="https://pytorch.org/docs/stable/distributed.pipelining.html">Pytorch Model Parallelism and Pipelining</a></li> <li class="fragment"><a href="https://xiandong79.github.io/Intro-Distributed-Deep-Learning">Intro to Distributed Deep Learning</a></li> Loading
src/distrib.slurm +3 −3 Original line number Diff line number Diff line #!/bin/bash #SBATCH --account=training2410 #SBATCH --account=training2436 #SBATCH --nodes=1 #SBATCH --job-name=ai-multi-gpu #SBATCH --ntasks-per-node=1 Loading @@ -7,7 +7,7 @@ #SBATCH --output=out-distrib.%j #SBATCH --error=err-distrib.%j #SBATCH --time=00:20:00 #SBATCH --partition=booster #SBATCH --partition=dc-gpu #SBATCH --gres=gpu:4 # Without this, srun does not inherit cpus-per-task from sbatch. Loading @@ -23,7 +23,7 @@ export MASTER_PORT=7010 export GPUS_PER_NODE=4 # Make sure we are on the right directory cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src # This loads modules and python packages source sc_venv_template/activate.sh Loading
src/serial.slurm +3 −3 Original line number Diff line number Diff line #!/bin/bash #SBATCH --account=training2410 #SBATCH --account=training2436 #SBATCH --nodes=1 #SBATCH --job-name=ai-serial #SBATCH --ntasks-per-node=1 Loading @@ -7,11 +7,11 @@ #SBATCH --output=out-serial.%j #SBATCH --error=err-serial.%j #SBATCH --time=00:40:00 #SBATCH --partition=develbooster #SBATCH --partition=dc-gpu #SBATCH --gres=gpu:1 # Make sure we are on the right directory cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src # This loads modules and python packages source sc_venv_template/activate.sh Loading