After we were able to run this code on the JUWELS Booster on 1 node, we want to try on several nodes. Therefore, we have to integrate horovod in this branch. bing_issue#027_test_horovod does not have the new devekop structure, thus we re-introduce it using this branch.
While migrating ou source-code to Juwels Booster (e.g. see #26 (closed) and #31 (closed)) for multiple GPUs, we faced a couple of user-defined problems. For instance, the source-code crashed for completely different reasons on Michael's account compared to Bing's and Scralet's account. Closer inspection of the virtual environment set-up revealed that this behaviour can be traced back to the initialization of the virtual environment. In order to access the container's Python packages, the flag --system-site-package has been added with commit a7ba6c1c. However, this also has the drawback that the user-specific Python site-packages (located under $HOME/.local/lib/python3.6/site-packages/) become accessable from the virtual environment, thus integrating the undesired user-specific dependancy.
An alternative and bette approach is to abandon the above mentioned flag and to extend PYTHONPATH by the path to the container's site packages (i.e. export PYTHONPATH=/usr/local/lib/python3.6/dist-packages:PYTHONPATH) while ensuring installation of all missing Python packages such as netCDF4-1.5.3 with requirements_booster.txt. In addition to include those adaptions, we agreed on fixing all package-versions to prevent from potential incompatibilites which may arise with newer Python package versions.
Since we have running code on Booster and did some first benchmarking, I agree we can close this issue. We might have to open a new one, after Andreas gives us feedback. Maybe he wants more/different tests or sports issues we have missed by now.
@langguth1 in performance_check.ipynb, the function def plot_speedup
The line plt.xticks(np.arange(0, len(ngpus)), xlabels), should be plt.xticks(np.arange(0, len(ngpus)-1), xlabels), right?
It throws errors saying the xlabels does not match the range
@gong1 Not sure, since I'm not able to check it due to maintenance on Juwels.
I will check as soon as possible.
Btw.: If you still work on this issue branch, shouldn't it be then kept opened?
If you are just working on perfomane_check.ipynb and the other Jupyter Notebook, you may consider to integrate it to develop?
Edit 2021-01-29:
I can approve that it should be plt.xticks(np.arange(0, len(ngpus)-1), xlabels) rather than plt.xticks(np.arange(0, len(ngpus)), xlabels). Thanks for pointing it out!