diff --git a/01-deep-learning-on-supercomputers.md b/01-deep-learning-on-supercomputers.md index 7749cc6854e7d806ab84dad9482fbbc494377995..06fe80cbfe0984455bca6709cd8b2fd925efd54a 100644 --- a/01-deep-learning-on-supercomputers.md +++ b/01-deep-learning-on-supercomputers.md @@ -652,25 +652,28 @@ real 1m19.979s - ```bash epoch train_loss valid_loss accuracy top_k_accuracy time -0 2.230926 2.414113 0.170986 0.654726 00:10 +0 2.242036 2.192690 0.201728 0.681148 00:10 epoch train_loss valid_loss accuracy top_k_accuracy time -0 1.986611 1.993477 0.298018 0.790142 00:06 -1 1.954962 2.180505 0.249238 0.765498 00:06 -2 1.915481 2.004775 0.301829 0.803354 00:06 -3 1.853237 1.827811 0.364583 0.837906 00:06 -4 1.783993 1.779548 0.391768 0.847307 00:06 -5 1.718417 1.642507 0.422002 0.884909 00:06 +0 2.035004 2.084082 0.246189 0.748984 00:05 +1 1.981432 2.054528 0.247205 0.764482 00:05 +2 1.942930 1.918441 0.316057 0.821138 00:05 +3 1.898426 1.832725 0.370173 0.839431 00:05 +4 1.859066 1.781805 0.375508 0.858740 00:05 +5 1.820968 1.743448 0.394055 0.864583 00:05 +real 1m15.651s ``` --- ## Some insights -- It's faster per epoch, but not by much (6 seconds vs 8 seconds) +- It's faster per epoch, but not by much (5 seconds vs 8 seconds) - Accuracy and loss suffered - This is a very simple model, so it's not surprising + - It fits into 4gb, we "stretched" it to a 320gb system - You need bigger models to really exercise the gpu and scaling -- There's a lot more to that +- There's a lot more to that, but for now, let's focus on medium/big sized models + - For Gigantic and Humongous-sized models, there's a DL scaling course at JSC! --- diff --git a/public/01-deep-learning-on-supercomputers.html b/public/01-deep-learning-on-supercomputers.html index 316b378bac70943833cdafb33b284789f93cbe18..179adaa365fb11689bd719c3b509d074ef5e74d5 100644 --- a/public/01-deep-learning-on-supercomputers.html +++ b/public/01-deep-learning-on-supercomputers.html @@ -850,27 +850,36 @@ submission file!</li> <ul> <li class="fragment"><div class="sourceCode" id="cb15"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.230926 2.414113 0.170986 0.654726 00:10 </span> +<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.242036 2.192690 0.201728 0.681148 00:10 </span> <span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="ex">epoch</span> train_loss valid_loss accuracy top_k_accuracy time </span> -<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 1.986611 1.993477 0.298018 0.790142 00:06 </span> -<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.954962 2.180505 0.249238 0.765498 00:06 </span> -<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.915481 2.004775 0.301829 0.803354 00:06 </span> -<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.853237 1.827811 0.364583 0.837906 00:06 </span> -<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.783993 1.779548 0.391768 0.847307 00:06 </span> -<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.718417 1.642507 0.422002 0.884909 00:06 </span></code></pre></div></li> +<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="ex">0</span> 2.035004 2.084082 0.246189 0.748984 00:05 </span> +<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="ex">1</span> 1.981432 2.054528 0.247205 0.764482 00:05 </span> +<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="ex">2</span> 1.942930 1.918441 0.316057 0.821138 00:05 </span> +<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="ex">3</span> 1.898426 1.832725 0.370173 0.839431 00:05 </span> +<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a><span class="ex">4</span> 1.859066 1.781805 0.375508 0.858740 00:05 </span> +<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a><span class="ex">5</span> 1.820968 1.743448 0.394055 0.864583 00:05</span> +<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span> 1m15.651s </span></code></pre></div></li> </ul> </section> <section id="some-insights-1" class="slide level2"> <h2>Some insights</h2> <ul> -<li class="fragment">It’s faster per epoch, but not by much (6 seconds +<li class="fragment">It’s faster per epoch, but not by much (5 seconds vs 8 seconds)</li> <li class="fragment">Accuracy and loss suffered</li> -<li class="fragment">This is a very simple model, so it’s not -surprising</li> +<li class="fragment">This is a very simple model, so it’s not surprising +<ul> +<li class="fragment">It fits into 4gb, we “stretched” it to a 320gb +system</li> +</ul></li> <li class="fragment">You need bigger models to really exercise the gpu and scaling</li> -<li class="fragment">There’s a lot more to that</li> +<li class="fragment">There’s a lot more to that, but for now, let’s +focus on medium/big sized models +<ul> +<li class="fragment">For Gigantic and Humongous-sized models, there’s a +DL scaling course at JSC!</li> +</ul></li> </ul> </section> <section id="thats-all-folks" class="slide level2">