Skip to content
Snippets Groups Projects
Commit fbe9d453 authored by Jan Ebert's avatar Jan Ebert
Browse files

Add section on Apptainers

parent f62bb14d
Branches
No related tags found
No related merge requests found
...@@ -31,6 +31,7 @@ https://medium.com/pytorch/pytorch-data-parallel-best-practices-on-google-cloud- ...@@ -31,6 +31,7 @@ https://medium.com/pytorch/pytorch-data-parallel-best-practices-on-google-cloud-
- [GPU architecture selection](#gpu-architecture-selection) - [GPU architecture selection](#gpu-architecture-selection)
- [Kernels not being compiled](#kernels-not-being-compiled) - [Kernels not being compiled](#kernels-not-being-compiled)
- [PyTorch Lightning](#pytorch-lightning) - [PyTorch Lightning](#pytorch-lightning)
- [Apptainers](#apptainers)
- [Advanced PyTorch Distributed debugging](#advanced-pytorch-distributed-debugging) - [Advanced PyTorch Distributed debugging](#advanced-pytorch-distributed-debugging)
- [DDP](#ddp) - [DDP](#ddp)
- [DDP considerations](#ddp-considerations) - [DDP considerations](#ddp-considerations)
...@@ -385,6 +386,86 @@ def patch_lightning_slurm_master_addr(): ...@@ -385,6 +386,86 @@ def patch_lightning_slurm_master_addr():
patch_lightning_slurm_master_addr() patch_lightning_slurm_master_addr()
``` ```
### Apptainers
At JSC, the available container runtime is Apptainer, which – in JSC's
configuration – uses read-only containers that we cannot modify or use
file system overlays with. This causes problems with certain versions
of some libraries, such as Triton, in which case an undynamic
implementation was present for some time. Before we look at how to
handle older versions of Triton (which require more work), though,
here is how to specify a CUDA library path for recent Triton versions
in case you encounter errors relating to it:
```shell
# Replace the path according to your container.
export TRITON_LIBCUDA_PATH=/usr/local/cuda/lib64/stubs
```
In certain development versions of Triton 2.1.0 (more specifcally,
from commit `c9ab44888ed445acf7acb7d377aae98e07630015` up to and
excluding commit `46452fae3bb072b9b8da4d1529a0af7c8f233de5`), it was
not possible to specify CUDA library paths. Because a specific
hardcoded query was used to obtain the library paths, we would have to
modify the container to update the queried library path caches, which
we are unable to do (due to Apptainers being read-only and JSC not
allowing overlays). A solution is to overwrite the erroneous file that
contains the query with a patched version at runtime using Apptainer's
`--bind` argument.
For simplicity's sake, let's assume that you have already set the
environment variable `TRITON_LIBCUDA_PATH` to the correct location
containing CUDA libraries (even though the varibale has no meaning in
this Triton version). This has the added benefit of future Triton
versions working automatically in case you decide to update the
container or Triton version. You also need to identify where
`triton/common/build.py` lies in the container. Ideally, you have an
error message that helps with this. In our example container, the full
path to `triton/common/build.py` is
`/usr/local/lib/python3.10/dist-packages/triton/common/build.py`,
which we store in the environment variable `old_triton_build_py_path`.
First, create the patched file from inside the container:
```shell
# This has to be executed from inside the container!
triton_build_py_path=/usr/local/lib/python3.10/dist-packages/triton/common/build.py
# Where to place the patched file. Has to be a location outside the container.
triton_build_py_patched_path=./triton-build-patched.py
sed 's:libs = subprocess\..*$:libs = "'"$TRITON_LIBCUDA_PATH"'":g' "triton_build_py_path" > "$triton_build_py_patched_path"
```
Then, when executing the container, bind the patched file to where the
old file lies (make sure that the variables `triton_build_py_path` and
`triton_build_py_patched_path` are set here as well):
```shell
apptainer run --bind "$triton_build_py_patched_path":"$triton_build_py_path" [...]
```
The same strategy can also be used to patch apply the [aforementioned
patches to PyTorch Lightning](#pytorch-lightning), with an example for
PyTorch Lightning ≥2 here:
```shell
# This has to be executed from inside the container!
pl_slurm_py_path=/usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/slurm.py
# Where to place the patched file. Has to be a location outside the container.
pl_slurm_py_patched_path=./pl-slurm-patched.py
sed 's:root_node = \(self\.resolve_root_node_address(.*)\)$:root_node = os.getenv("MASTER_ADDR", \1):g' "$pl_slurm_py_path" > "$pl_slurm_py_patched_path"
```
Then, when executing the container, bind the patched file to where the
old file lies (make sure that the variables `pl_slurm_py_path` and
`pl_slurm_py_patched_path` are set here as well):
```shell
apptainer run --bind "$pl_slurm_py_patched_path":"$pl_slurm_py_path" [...]
```
Of course, other libraries that require it can similarly be patched
using Apptainer's `--bind` functionality.
### Advanced PyTorch Distributed debugging ### Advanced PyTorch Distributed debugging
To enable logging for the Python parts of PyTorch Distributed, please To enable logging for the Python parts of PyTorch Distributed, please
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment