Skip to content
Snippets Groups Projects
Select Git revision
  • a03c1c03cbe901b3fb1ac08d14ce15201d34a512
  • main default protected
2 results

index.md

Blame
  • author: Alexandre Strube
    title: Deep Learning on Supercomputers
    # subtitle: A primer in supercomputers`
    date: November 13, 2024

    Resources:


    Team:

    ::: {.container} :::: {.col} Alexandre Strube :::: :::: {.col} Ilya Zhukov :::: :::: {.col} Jolanta Zjupa :::: :::


    Goal for this talk:


    Slides on your own computer:

    Please access it now, so you can follow along:

    https://go.fzj.de/2024-11-talk-intro-to-supercomputing-jsc


    Git clone this repository

    • All slides and source code
    • Connect to the supercomputer and do this:

    git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git

    
    ---
    
    ## Deep learning is...
    
    ![](images/matrix-multiplication-meme.jpg)
    
    ---
    
    ## Matrix Multiplication Recap
    
    ![](images/matrix-multiplication.svg)
    
    ---
    
    ## Parallelism in a GPU
    
    - Operations on each element of the matrix are independent
    - Ergo, each element can be computed in parallel
    - Extremely good to apply the same operation to a lot of data
    - ![](images/gpu-parallel.gif)
    - Paralellization done "for free" by the ML toolkits
    
    ---
    
    ## Are we there yet?
    
    ![](images/are-we-there-yet.gif)
    
    ---
    
    ## Not quite
    
    - Compute in one GPU is parallel, yes
    - ![](images/gpu0.svg)
    
    ---
    
    ## But what about many GPUs?
    
    - It's when things get interesting
    
    ---
    
    ## Data Parallel
    
    ![](images/data-parallel.svg)
    
    ---
    
    ## Data Parallel
    
    ![](images/data-parallel-multiple-data.svg)
    
    ---
    
    ## Data Parallel - Averaging
    
    ![](images/data-parallel-averaging.svg)
    
    ---
    
    ## Data Parallel
    
    ### There are other approaches too, e.g.
    
    - For the sake of completeness:
        - Asynchronous Stochastic Gradient Descent
            - Don't average the parameters, but send the updates (gradients post learning rate and momentum) asynchronously
            - Advantageous for slow networks
            - Problem: stale gradient (things might change while calculating gradients)
            - The more nodes, the worse it gets
            - Won't talk about it anymore
    
    ---
    
    ## Data Parallel
    
    ### There are other approaches too!
    
    - Decentralized Asychronous Stochastic Gradient Descent
        - Updates are peer-to-peer
        - The updates are heavily compressed and quantized
        - Disadvantage: extra computation per minibatch, more memory needed
    
    - WE DON'T NEED THOSE
    
    ---
    
    ## That's it for data parallel!
    
    - Use different data for each GPU
    - Everything else is the same
    - Average after each epoch
    
    ---
    
    ## Well, almost...
    
    ---
    
    ## There are more levels!
    
    ![](images/lets-go-deeper.jpg)
    
    --- 
    
    ## Data Parallel - Multi Node
    
    ![](images/data-parallel-multi-node.svg)
    
    ---
    
    ## Data Parallel - Multi Node
    
    ![](images/data-parallel-multi-node-averaging.svg)
    
    ---
    
    ## Before we go further...
    
    - Data parallel is usually good enough 👌
    - If you need more than this, you should be giving this course, not me 🤷‍♂️
    
    ---
    
    ## Are we there yet?
    
    ![](images/are-we-there-yet-2.gif)
    
    ---
    
    ## Model Parallel
    
    - Model *itself* is too big to fit in one single GPU 🐋
    - Each GPU holds a slice of the model 🍕
    - Data moves from one GPU to the next
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel.svg)
    
    ---
    
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-1.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-2.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-3.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-4.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-5.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-6.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-7.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-8.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-9.svg)
    
    ---
    
    ## Model Parallel
    
    ![](images/model-parallel-pipeline-10.svg)
    
    ---
    
    ## What's the problem here? 🧐
    
    ---
    
    ## Model Parallel
    
    - Waste of resources
    - While one GPU is working, others are waiting the whole process to end
    - ![](images/no_pipe.png)
        - [Source: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
    
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-1.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-2-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-3-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-4-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-5-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-6-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-7-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-8-multibatch.svg)
    
    ---
    
    ## Model Parallel - Pipelining
    
    ![](images/model-parallel-pipeline-9-multibatch.svg)
    
    ---
    
    ## This is an oversimplification!
    
    - Actually, you split the input minibatch into multiple microbatches.
    - There's still idle time - an unavoidable "bubble" 🫧
    - ![](images/pipe.png)
    
    ---
    
    ## Are we there yet?
    
    ![](images/are-we-there-yet-3.gif)
    
    ---
    
    ## Model Parallel - Multi Node
    
    - In this case, each node does the same as the others. 
    - At each step, they all synchronize their weights.
    
    ---
    
    ## Model Parallel - Multi Node
    
    ![](images/model-parallel-multi-node.svg)
    
    ---
    
    ## Model Parallel - going bigger
    
    - You can also have layers spreaded over multiple gpus
    - One can even pipeline among nodes....
    
    ---
    
    ## Recap
    
    - Data parallelism:
        - Split the data over multiple GPUs
        - Each GPU runs the whole model
        - The gradients are averaged at each step
    - Data parallelism, multi-node:
        - Same, but gradients are averaged across nodes
    - Model parallelism:
        - Split the model over multiple GPUs
        - Each GPU does the forward/backward pass
        - The gradients are averaged at the end
    - Model parallelism, multi-node:
        - Same, but gradients are averaged across nodes
    
    ---
    
    ## Are we there yet?
    
    ![](images/are-we-there-yet-4.gif)
    
    ---
    
    ## If you haven't done so, please access the slides to clone repository:
    
    ![](images/slides.png)
    
    - ```bash
    git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-11-talk-intro-to-supercompting-jsc.git

    DEMO TIME!

    • Let's take a simple model
    • Run it "serially" (single-gpu)
    • We make it data parallel among multiple gpus in one node
    • Then we make it multi-node!

    Expected imports

    from fastai.vision.all import *
    from fastai.distributed import *
    from fastai.vision.models.xresnet import *
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Bringing your data in*

    from fastai.vision.all import *
    from fastai.distributed import *
    from fastai.vision.models.xresnet import *
    # DOWNLOADS DATASET - we need to do this on the login node
    path = untar_data(URLs.IMAGEWOOF_320) 
    
    
    
    
    
    
    
    
    
    
    
    
    

    Loading your data

    from fastai.vision.all import *
    from fastai.distributed import *
    from fastai.vision.models.xresnet import *
    
    path = untar_data(URLs.IMAGEWOOF_320)
    dls = DataBlock(
        blocks=(ImageBlock, CategoryBlock),
        splitter=GrandparentSplitter(valid_name='val'),
        get_items=get_image_files, get_y=parent_label,
        item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
        batch_tfms=Normalize.from_stats(*imagenet_stats)
    ).dataloaders(path, path=path, bs=64)
    
    
    
    
    
    

    Single-gpu code

    from fastai.vision.all import *
    from fastai.distributed import *
    from fastai.vision.models.xresnet import *
    
    path = untar_data(URLs.IMAGEWOOF_320)
    dls = DataBlock(
        blocks=(ImageBlock, CategoryBlock),
        splitter=GrandparentSplitter(valid_name='val'),
        get_items=get_image_files, get_y=parent_label,
        item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
        batch_tfms=Normalize.from_stats(*imagenet_stats)
    ).dataloaders(path, path=path, bs=64)
    
    learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
    
    learn.fine_tune(6)

    Venv_template

    • EXTREMELY useful thing to keep your software in order
    • Make a venv with the correct supercomputer modules
    • Only add new requirements
    • Link to gitlab repo

    cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git

    - Add this to sc_venv_template/requirements.txt:
    - ```python
    # Add here the pip packages you would like to install on this virtual environment / kernel
    pip
    fastai==2.7.15
    scipy==1.11.1
    matplotlib==3.7.2
    scikit-learn==1.3.1
    pandas==2.0.3
    torch==2.1.2
    accelerate

    sc_venv_template/setup.sh source sc_venv_template/activate.sh

    
    - Done! You installed everything you need
    
    ---
    
    ## Submission Script
    
    ```bash
    #!/bin/bash
    #SBATCH --account=training2436
    #SBATCH --nodes=1
    #SBATCH --job-name=ai-serial
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=1
    #SBATCH --output=out-serial.%j
    #SBATCH --error=err-serial.%j
    #SBATCH --time=00:40:00
    #SBATCH --partition=dc-gpu
    
    # Make sure we are on the right directory
    cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
    
    # This loads modules and python packages
    source sc_venv_template/activate.sh
    
    # Run the demo
    time srun python serial.py

    Download dataset

    • Compute nodes have no internet
    • We need to download the dataset

    Download dataset

    cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src
    source sc_venv_template/activate.sh
    python serial.py
    
    (Some warnings)
    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
    • It started training, on the login node's CPUs (WRONG!!!)
    • That means we have the data!
    • We just cancel with Ctrl+C

    Running it

    cd $HOME/2024-11-talk-intro-to-supercompting-jsc/src sbatch serial.slurm

    - On Juwels Booster, should take about 5 minutes
    - On a cpu system this would take half a day
    - Check the out-serial-XXXXXX and err-serial-XXXXXX files
    
    ---
    
    ## Going data parallel
    
    - Almost same code as before, let's show the differences
    
    ---
    
    ## Data parallel
    
    ```python
    from fastai.vision.all import *
    from fastai.distributed import *
    from fastai.vision.models.xresnet import *
    
    path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
    dls = DataBlock(
        blocks=(ImageBlock, CategoryBlock),
        splitter=GrandparentSplitter(valid_name='val'),
        get_items=get_image_files, get_y=parent_label,
        item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
        batch_tfms=Normalize.from_stats(*imagenet_stats)
    ).dataloaders(path, path=path, bs=64)
    
    learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
    with learn.distrib_ctx():
        learn.fine_tune(6)

    Data Parallel

    What changed?

    It was

    path = untar_data(URLs.IMAGEWOOF_320)

    Became

    path = rank0_first(untar_data, URLs.IMAGEWOOF_320)

    It was

    learn.fine_tune(6)

    Became

    with learn.distrib_ctx():
        learn.fine_tune(6)

    Submission script: data parallel

    #SBATCH --cpus-per-task=48 #SBATCH --gres=gpu:4

    
    ---
    
    ## Let's check the outputs!
    
    #### Single gpu:
    
    ```bash
    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         2.249933    2.152813    0.225757  0.750573        01:11                          
    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         1.882008    1.895813    0.324510  0.832018        00:44                          
    1         1.837312    1.916380    0.374141  0.845253        00:44                          
    2         1.717144    1.739026    0.378722  0.869941        00:43                          
    3         1.594981    1.637526    0.417664  0.891575        00:44                          
    4         1.460454    1.410519    0.507254  0.920336        00:44                          
    5         1.389946    1.304924    0.538814  0.935862        00:43  
    real	5m44.972s

    Multi gpu:

    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         2.201540    2.799354    0.202950  0.662513        00:09                        
    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         1.951004    2.059517    0.294761  0.781282        00:08                        
    1         1.929561    1.999069    0.309512  0.792981        00:08                        
    2         1.854629    1.962271    0.314344  0.840285        00:08                        
    3         1.754019    1.687136    0.404883  0.872330        00:08                        
    4         1.643759    1.499526    0.482706  0.906409        00:08                        
    5         1.554356    1.450976    0.502798  0.914547        00:08  
    real	1m19.979s

    Some insights

    • Distributed run suffered a bit on the accuracy :dart: and loss :weary:
      • In exchange for speed :race_car:
      • Train a bit longer and you're good!
    • It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
    • I/O is automatically parallelized / sharded by Fast.AI library
    • Data parallel is a simple and effective way to distribute DL workload :muscle:
    • This is really just a primer - there's much more to that
    • I/O plays a HUGE role on Supercomputers, for example

    Multi-node

    • Simply change #SBATCH --nodes=2 on the submission file!
    • THAT'S IT

    Multi-node

    epoch train_loss valid_loss accuracy top_k_accuracy time
    0 2.242036 2.192690 0.201728 0.681148 00:10
    epoch train_loss valid_loss accuracy top_k_accuracy time
    0 2.035004 2.084082 0.246189 0.748984 00:05
    1 1.981432 2.054528 0.247205 0.764482 00:05
    2 1.942930 1.918441 0.316057 0.821138 00:05
    3 1.898426 1.832725 0.370173 0.839431 00:05
    4 1.859066 1.781805 0.375508 0.858740 00:05
    5 1.820968 1.743448 0.394055 0.864583 00:05 real 1m15.651s

    
    ---
    
    ## Some insights
    
    - It's faster per epoch, but not by much (5 seconds vs 8 seconds)
    - Accuracy and loss suffered
    - This is a very simple model, so it's not surprising
        - It fits into 4gb, we "stretched" it to a 320gb system
        - It's not a good fit for this system
    - You need bigger models to really exercise the gpu and scaling
    - There's a lot more to that, but for now, let's focus on medium/big sized models
        - For Gigantic and Humongous-sized models, there's a DL scaling course at JSC!
    
    ---
    
    ## That's all folks!
    
    - Thanks for listening!
    - Questions?
    
    ---
    
    ## References
    
    - [Pytorch Model Parallelism and Pipelining](https://pytorch.org/docs/stable/distributed.pipelining.html)
    - [Intro to Distributed Deep Learning](https://xiandong79.github.io/Intro-Distributed-Deep-Learning)
    - [Model Parallelism - Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-intro.html)