Final fixes to the code

730f918c · Alexandre Strube · a2aff205 · 730f918c · 730f918c · 730f918c
Commit 730f918c authored 1 year ago by Alexandre Strube
--- a/01-deep-learning-on-supercomputers.md
+++ b/01-deep-learning-on-supercomputers.md
@@ -394,10 +394,16 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompti
 ---
-## Again, please access the slides to clone repository:
+## If you haven't done so, please access the slides to clone repository:
 ![](images/slides.png)
+- ```bash
+git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc.git
+```
 ---
 ## DEMO TIME!
@@ -434,13 +440,13 @@ from fastai.vision.models.xresnet import *
 ---
-## Bringing your data in
+## Bringing your data in*
 ```python
 from fastai.vision.all import *
 from fastai.distributed import *
 from fastai.vision.models.xresnet import *
+# DOWNLOADS DATASET - we need to do this on the login node
 path = untar_data(URLs.IMAGEWOOF_320) 
@@ -457,7 +463,6 @@ path = untar_data(URLs.IMAGEWOOF_320)
 ```
 ---
 ## Loading your data
@@ -515,17 +520,21 @@ learn.fine_tune(6)
 - Only add new requirements
 - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template)
 - ```bash
+cd $HOME/2023-nov-intro-to-supercompting-jsc.git/src
 git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
 ```
- Add this to requirements.txt:
+- Add this to sc_venv_template/requirements.txt:
 - ```python
 fastai
 deepspeed
 accelerate
 ```
- (the last one will become `accelerate` later this week)
- Run `./setup.sh`
+- ```bash
- `source activate.sh`
+sc_venv_template/setup.sh
+source sc_venv_template/activate.sh
+```
 - Done! You installed everything you need
 ---
@@ -533,7 +542,7 @@ accelerate
 ## Submission Script
 ```bash
-#!/bin/bash -x
+#!/bin/bash
 #SBATCH --account=training2334
 #SBATCH --nodes=1
 #SBATCH --job-name=ai-serial
@@ -556,6 +565,33 @@ time srun python serial.py
 ---
+## Download dataset
+- Compute nodes have no internet
+- We need to download the dataset
+---
+## Download dataset
+```bash
+cd $HOME/2023-nov-intro-to-supercompting-jsc/src
+source sc_venv_template/activate.sh
+python serial.py
+(Some warnings)
+epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
+Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
+```
+- It started training, on the login node's CPUs (WRONG!!!)
+- That means we have the data!
+- We just cancel with Ctrl+C
+---
 ## Running it
 - ```bash
@@ -564,12 +600,13 @@ sbatch serial.slurm
 ```
 - On Juwels Booster, should take about 5 minutes
 - On a cpu system this would take half a day
+- Check the out-serial-XXXXXX and err-serial-XXXXXX files
 ---
 ## Going data parallel
- Same code as before, let's show the differences
+- Almost same code as before, let's show the differences
 ---
@@ -630,6 +667,13 @@ with learn.distrib_ctx():
 - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm)
+- Main differences: 
+- ```bash
+#SBATCH --cpus-per-task=48
+#SBATCH --gres=gpu:4
+```
 ---
 ## Let's check the outputs!
@@ -670,6 +714,7 @@ real	1m19.979s
 - Distributed run suffered a bit on the accuracy 🎯 and loss 😩
  - In exchange for speed 🏎️
+  - Train a bit longer and you're good!
 - It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
 - I/O is automatically parallelized / sharded by Fast.AI library
 - Data parallel is a simple and effective way to distribute DL workload 💪
@@ -708,6 +753,7 @@ real	1m15.651s
 - Accuracy and loss suffered
 - This is a very simple model, so it's not surprising
    - It fits into 4gb, we "stretched" it to a 320gb system
+    - It's not a good fit for this system
 - You need bigger models to really exercise the gpu and scaling
 - There's a lot more to that, but for now, let's focus on medium/big sized models
    - For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!

--- a/public/01-deep-learning-on-supercomputers.html
+++ b/public/01-deep-learning-on-supercomputers.html
--- a/src/serial.slurm
+++ b/src/serial.slurm
-#!/bin/bash -x
+#!/bin/bash
 #SBATCH --account=training2334
 #SBATCH --nodes=1
 #SBATCH --job-name=ai-serial