Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
2
2023-nov-intro-to-supercompting-jsc
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Alexandre Strube
2023-nov-intro-to-supercompting-jsc
Commits
ad0e525f
Commit
ad0e525f
authored
2 years ago
by
Alexandre Strube
Browse files
Options
Downloads
Patches
Plain Diff
save as 1 node on the example
parent
4d8f63ba
No related branches found
No related tags found
No related merge requests found
Pipeline
#140815
passed
2 years ago
Stage: test
Stage: deploy
Changes
3
Pipelines
1
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
01-deep-learning-on-supercomputers.md
+9
-3
9 additions, 3 deletions
01-deep-learning-on-supercomputers.md
public/01-deep-learning-on-supercomputers.html
+23
-21
23 additions, 21 deletions
public/01-deep-learning-on-supercomputers.html
src/distrib.slurm
+1
-1
1 addition, 1 deletion
src/distrib.slurm
with
33 additions
and
25 deletions
01-deep-learning-on-supercomputers.md
+
9
−
3
View file @
ad0e525f
...
...
@@ -490,9 +490,10 @@ learn.fine_tune(6)
-
Add this to requirements.txt:
-
```python
fastai
accelerate
deepspeed
git+https://github.com/huggingface/accelerate@rdzv-endpoint
```
-
(the last one will become
`accelerate`
later this week)
-
Run
`./setup.sh`
-
`source activate.sh`
-
Done! You installed everything you need
...
...
@@ -625,16 +626,21 @@ epoch train_loss valid_loss accuracy top_k_accuracy time
5 1.554356 1.450976 0.502798 0.914547 00:08
real 1m19.979s
```
---
## Some insights
-
Distributed run suffered a bit on the accuracy and loss in exchange for speed 🏎️
-
Data parallel is a simple and effective way to distribute DL workload
-
Distributed run suffered a bit on the accuracy 🎯 and loss 😩
-
In exchange for speed 🏎️
-
It's more than 4x faster because the library is multi-threaded (and now we use 48 threads)
-
I/O is automatically parallelized / sharded by Fast.AI library
-
Data parallel is a simple and effective way to distribute DL workload 💪
-
This is really just a primer - there's much more to that
-
I/O plays a HUGE role on Supercomputers, for example
---
## Multi-node
-
Simply change
`#SBATCH --nodes=2`
on the submission file!
...
...
This diff is collapsed.
Click to expand it.
public/01-deep-learning-on-supercomputers.html
+
23
−
21
View file @
ad0e525f
...
...
@@ -692,8 +692,10 @@ class="fragment"><code>git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc
<li
class=
"fragment"
>
Add this to requirements.txt:
</li>
<li
class=
"fragment"
><div
class=
"sourceCode"
id=
"cb5"
><pre
class=
"sourceCode python"
><code
class=
"sourceCode python"
><span
id=
"cb5-1"
><a
href=
"#cb5-1"
aria-hidden=
"true"
tabindex=
"-1"
></a>
fastai
</span>
<span
id=
"cb5-2"
><a
href=
"#cb5-2"
aria-hidden=
"true"
tabindex=
"-1"
></a>
accelerate
</span>
<span
id=
"cb5-3"
><a
href=
"#cb5-3"
aria-hidden=
"true"
tabindex=
"-1"
></a>
deepspeed
</span></code></pre></div></li>
<span
id=
"cb5-2"
><a
href=
"#cb5-2"
aria-hidden=
"true"
tabindex=
"-1"
></a>
deepspeed
</span>
<span
id=
"cb5-3"
><a
href=
"#cb5-3"
aria-hidden=
"true"
tabindex=
"-1"
></a>
git
<span
class=
"op"
>
+
</span>
https:
<span
class=
"op"
>
//
</span>
github.com
<span
class=
"op"
>
/
</span>
huggingface
<span
class=
"op"
>
/
</span>
accelerate
<span
class=
"op"
>
@
</span>
rdzv
<span
class=
"op"
>
-
</span>
endpoint
</span></code></pre></div></li>
<li
class=
"fragment"
>
(the last one will become
<code>
accelerate
</code>
later this week)
</li>
<li
class=
"fragment"
>
Run
<code>
./setup.sh
</code></li>
<li
class=
"fragment"
><code>
source activate.sh
</code></li>
<li
class=
"fragment"
>
Done! You installed everything you need
</li>
...
...
@@ -812,40 +814,40 @@ class="sourceCode bash"><code class="sourceCode bash"><span id="cb14-1"><a href=
<span
id=
"cb14-7"
><a
href=
"#cb14-7"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
3
</span>
1.754019 1.687136 0.404883 0.872330 00:08
</span>
<span
id=
"cb14-8"
><a
href=
"#cb14-8"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
4
</span>
1.643759 1.499526 0.482706 0.906409 00:08
</span>
<span
id=
"cb14-9"
><a
href=
"#cb14-9"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
5
</span>
1.554356 1.450976 0.502798 0.914547 00:08
</span>
<span
id=
"cb14-10"
><a
href=
"#cb14-10"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
real
</span>
1m19.979s
</span></code></pre></div>
<hr
/></li>
<span
id=
"cb14-10"
><a
href=
"#cb14-10"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
real
</span>
1m19.979s
</span></code></pre></div></li>
</ul>
</section>
<section
id=
"some-insights"
class=
"slide level2"
>
<h2>
Some insights
</h2>
<ul>
<li
class=
"fragment"
>
Distributed run suffered a bit on the accuracy and
loss in exchange for speed 🏎️
</li>
<li
class=
"fragment"
>
Distributed run suffered a bit on the accuracy 🎯
and loss 😩
<ul>
<li
class=
"fragment"
>
In exchange for speed 🏎️
</li>
</ul></li>
<li
class=
"fragment"
>
It’s more than 4x faster because the library is
multi-threaded (and now we use 48 threads)
</li>
<li
class=
"fragment"
>
I/O is automatically parallelized / sharded by
Fast.AI library
</li>
<li
class=
"fragment"
>
Data parallel is a simple and effective way to
distribute DL workload
</li>
distribute DL workload
💪
</li>
<li
class=
"fragment"
>
This is really just a primer - there’s much more to
that
</li>
<li
class=
"fragment"
>
I/O plays a HUGE role on Supercomputers, for
example
</li>
</ul>
<table
style=
"width:6%;"
>
<colgroup>
<col
style=
"width: 5%"
/>
</colgroup>
<tbody>
<tr
class=
"odd"
>
<td>
## Multi-node
</td>
</tr>
<tr
class=
"even"
>
<td>
- Simply change
<code>
#SBATCH --nodes=2
</code>
on the submission
file! - THAT’S IT
</td>
</tr>
</tbody>
</table>
</section>
<section
id=
"multi-node"
class=
"slide level2"
>
<h2>
Multi-node
</h2>
<ul>
<li
class=
"fragment"
>
Simply change
<code>
#SBATCH --nodes=2
</code>
on the
submission file!
</li>
<li
class=
"fragment"
>
THAT’S IT
</li>
</ul>
</section>
<section
id=
"multi-node-1"
class=
"slide level2"
>
<h2>
Multi-node
</h2>
<ul>
<li
class=
"fragment"
><div
class=
"sourceCode"
id=
"cb15"
><pre
class=
"sourceCode bash"
><code
class=
"sourceCode bash"
><span
id=
"cb15-1"
><a
href=
"#cb15-1"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
epoch
</span>
train_loss valid_loss accuracy top_k_accuracy time
</span>
<span
id=
"cb15-2"
><a
href=
"#cb15-2"
aria-hidden=
"true"
tabindex=
"-1"
></a><span
class=
"ex"
>
0
</span>
2.230926 2.414113 0.170986 0.654726 00:10
</span>
...
...
This diff is collapsed.
Click to expand it.
src/distrib.slurm
+
1
−
1
View file @
ad0e525f
#!/bin/bash
#SBATCH --account=training2306
#SBATCH --nodes=
2
#SBATCH --nodes=
1
#SBATCH --job-name=ai-multi-gpu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment