Skip to content
Snippets Groups Projects
Commit 730f918c authored by Alexandre Strube's avatar Alexandre Strube
Browse files

Final fixes to the code

parent a2aff205
No related branches found
No related tags found
No related merge requests found
Pipeline #165855 passed
...@@ -394,10 +394,16 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompti ...@@ -394,10 +394,16 @@ git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompti
--- ---
## Again, please access the slides to clone repository: ## If you haven't done so, please access the slides to clone repository:
![](images/slides.png) ![](images/slides.png)
- ```bash
git clone https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc.git
```
--- ---
## DEMO TIME! ## DEMO TIME!
...@@ -434,13 +440,13 @@ from fastai.vision.models.xresnet import * ...@@ -434,13 +440,13 @@ from fastai.vision.models.xresnet import *
--- ---
## Bringing your data in ## Bringing your data in*
```python ```python
from fastai.vision.all import * from fastai.vision.all import *
from fastai.distributed import * from fastai.distributed import *
from fastai.vision.models.xresnet import * from fastai.vision.models.xresnet import *
# DOWNLOADS DATASET - we need to do this on the login node
path = untar_data(URLs.IMAGEWOOF_320) path = untar_data(URLs.IMAGEWOOF_320)
...@@ -457,7 +463,6 @@ path = untar_data(URLs.IMAGEWOOF_320) ...@@ -457,7 +463,6 @@ path = untar_data(URLs.IMAGEWOOF_320)
``` ```
--- ---
## Loading your data ## Loading your data
...@@ -515,17 +520,21 @@ learn.fine_tune(6) ...@@ -515,17 +520,21 @@ learn.fine_tune(6)
- Only add new requirements - Only add new requirements
- [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template) - [Link to gitlab repo](https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template)
- ```bash - ```bash
cd $HOME/2023-nov-intro-to-supercompting-jsc.git/src
git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
``` ```
- Add this to requirements.txt: - Add this to sc_venv_template/requirements.txt:
- ```python - ```python
fastai fastai
deepspeed deepspeed
accelerate accelerate
``` ```
- (the last one will become `accelerate` later this week)
- Run `./setup.sh` - ```bash
- `source activate.sh` sc_venv_template/setup.sh
source sc_venv_template/activate.sh
```
- Done! You installed everything you need - Done! You installed everything you need
--- ---
...@@ -533,7 +542,7 @@ accelerate ...@@ -533,7 +542,7 @@ accelerate
## Submission Script ## Submission Script
```bash ```bash
#!/bin/bash -x #!/bin/bash
#SBATCH --account=training2334 #SBATCH --account=training2334
#SBATCH --nodes=1 #SBATCH --nodes=1
#SBATCH --job-name=ai-serial #SBATCH --job-name=ai-serial
...@@ -556,6 +565,33 @@ time srun python serial.py ...@@ -556,6 +565,33 @@ time srun python serial.py
--- ---
## Download dataset
- Compute nodes have no internet
- We need to download the dataset
---
## Download dataset
```bash
cd $HOME/2023-nov-intro-to-supercompting-jsc/src
source sc_venv_template/activate.sh
python serial.py
(Some warnings)
epoch train_loss valid_loss accuracy top_k_accuracy time
Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
```
- It started training, on the login node's CPUs (WRONG!!!)
- That means we have the data!
- We just cancel with Ctrl+C
---
## Running it ## Running it
- ```bash - ```bash
...@@ -564,12 +600,13 @@ sbatch serial.slurm ...@@ -564,12 +600,13 @@ sbatch serial.slurm
``` ```
- On Juwels Booster, should take about 5 minutes - On Juwels Booster, should take about 5 minutes
- On a cpu system this would take half a day - On a cpu system this would take half a day
- Check the out-serial-XXXXXX and err-serial-XXXXXX files
--- ---
## Going data parallel ## Going data parallel
- Same code as before, let's show the differences - Almost same code as before, let's show the differences
--- ---
...@@ -630,6 +667,13 @@ with learn.distrib_ctx(): ...@@ -630,6 +667,13 @@ with learn.distrib_ctx():
- Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm) - Please check the course repository: [src/distrib.slurm](https://gitlab.jsc.fz-juelich.de/strube1/2023-nov-intro-to-supercompting-jsc/-/blob/main/src/distrib.slurm)
- Main differences:
- ```bash
#SBATCH --cpus-per-task=48
#SBATCH --gres=gpu:4
```
--- ---
## Let's check the outputs! ## Let's check the outputs!
...@@ -670,6 +714,7 @@ real 1m19.979s ...@@ -670,6 +714,7 @@ real 1m19.979s
- Distributed run suffered a bit on the accuracy 🎯 and loss 😩 - Distributed run suffered a bit on the accuracy 🎯 and loss 😩
- In exchange for speed 🏎️ - In exchange for speed 🏎️
- Train a bit longer and you're good!
- It's more than 4x faster because the library is multi-threaded (and now we use 48 threads) - It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
- I/O is automatically parallelized / sharded by Fast.AI library - I/O is automatically parallelized / sharded by Fast.AI library
- Data parallel is a simple and effective way to distribute DL workload 💪 - Data parallel is a simple and effective way to distribute DL workload 💪
...@@ -708,6 +753,7 @@ real 1m15.651s ...@@ -708,6 +753,7 @@ real 1m15.651s
- Accuracy and loss suffered - Accuracy and loss suffered
- This is a very simple model, so it's not surprising - This is a very simple model, so it's not surprising
- It fits into 4gb, we "stretched" it to a 320gb system - It fits into 4gb, we "stretched" it to a 320gb system
- It's not a good fit for this system
- You need bigger models to really exercise the gpu and scaling - You need bigger models to really exercise the gpu and scaling
- There's a lot more to that, but for now, let's focus on medium/big sized models - There's a lot more to that, but for now, let's focus on medium/big sized models
- For Gigantic and Humongous-sized models, there's a DL scaling course at JSC! - For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!
......
This diff is collapsed.
#!/bin/bash -x #!/bin/bash
#SBATCH --account=training2334 #SBATCH --account=training2334
#SBATCH --nodes=1 #SBATCH --nodes=1
#SBATCH --job-name=ai-serial #SBATCH --job-name=ai-serial
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment