From fbe9d453dca1fa84fb4921ca375aca5ec593c6aa Mon Sep 17 00:00:00 2001
From: janEbert <janpublicebert@posteo.net>
Date: Wed, 18 Sep 2024 22:40:35 +0200
Subject: [PATCH] Add section on Apptainers

---
 README.md | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/README.md b/README.md
index 4ba328e..e479f86 100644
--- a/README.md
+++ b/README.md
@@ -31,6 +31,7 @@ https://medium.com/pytorch/pytorch-data-parallel-best-practices-on-google-cloud-
             - [GPU architecture selection](#gpu-architecture-selection)
             - [Kernels not being compiled](#kernels-not-being-compiled)
         - [PyTorch Lightning](#pytorch-lightning)
+        - [Apptainers](#apptainers)
         - [Advanced PyTorch Distributed debugging](#advanced-pytorch-distributed-debugging)
     - [DDP](#ddp)
         - [DDP considerations](#ddp-considerations)
@@ -385,6 +386,86 @@ def patch_lightning_slurm_master_addr():
 patch_lightning_slurm_master_addr()
 ```
 
+### Apptainers
+
+At JSC, the available container runtime is Apptainer, which – in JSC's
+configuration – uses read-only containers that we cannot modify or use
+file system overlays with. This causes problems with certain versions
+of some libraries, such as Triton, in which case an undynamic
+implementation was present for some time. Before we look at how to
+handle older versions of Triton (which require more work), though,
+here is how to specify a CUDA library path for recent Triton versions
+in case you encounter errors relating to it:
+
+```shell
+# Replace the path according to your container.
+export TRITON_LIBCUDA_PATH=/usr/local/cuda/lib64/stubs
+```
+
+In certain development versions of Triton 2.1.0 (more specifcally,
+from commit `c9ab44888ed445acf7acb7d377aae98e07630015` up to and
+excluding commit `46452fae3bb072b9b8da4d1529a0af7c8f233de5`), it was
+not possible to specify CUDA library paths. Because a specific
+hardcoded query was used to obtain the library paths, we would have to
+modify the container to update the queried library path caches, which
+we are unable to do (due to Apptainers being read-only and JSC not
+allowing overlays). A solution is to overwrite the erroneous file that
+contains the query with a patched version at runtime using Apptainer's
+`--bind` argument.
+
+For simplicity's sake, let's assume that you have already set the
+environment variable `TRITON_LIBCUDA_PATH` to the correct location
+containing CUDA libraries (even though the varibale has no meaning in
+this Triton version). This has the added benefit of future Triton
+versions working automatically in case you decide to update the
+container or Triton version. You also need to identify where
+`triton/common/build.py` lies in the container. Ideally, you have an
+error message that helps with this. In our example container, the full
+path to `triton/common/build.py` is
+`/usr/local/lib/python3.10/dist-packages/triton/common/build.py`,
+which we store in the environment variable `old_triton_build_py_path`.
+
+First, create the patched file from inside the container:
+
+```shell
+# This has to be executed from inside the container!
+triton_build_py_path=/usr/local/lib/python3.10/dist-packages/triton/common/build.py
+# Where to place the patched file. Has to be a location outside the container.
+triton_build_py_patched_path=./triton-build-patched.py
+sed 's:libs = subprocess\..*$:libs = "'"$TRITON_LIBCUDA_PATH"'":g' "triton_build_py_path" > "$triton_build_py_patched_path"
+```
+
+Then, when executing the container, bind the patched file to where the
+old file lies (make sure that the variables `triton_build_py_path` and
+`triton_build_py_patched_path` are set here as well):
+
+```shell
+apptainer run --bind "$triton_build_py_patched_path":"$triton_build_py_path" [...]
+```
+
+The same strategy can also be used to patch apply the [aforementioned
+patches to PyTorch Lightning](#pytorch-lightning), with an example for
+PyTorch Lightning ≥2 here:
+
+```shell
+# This has to be executed from inside the container!
+pl_slurm_py_path=/usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/slurm.py
+# Where to place the patched file. Has to be a location outside the container.
+pl_slurm_py_patched_path=./pl-slurm-patched.py
+sed 's:root_node = \(self\.resolve_root_node_address(.*)\)$:root_node = os.getenv("MASTER_ADDR", \1):g' "$pl_slurm_py_path" > "$pl_slurm_py_patched_path"
+```
+
+Then, when executing the container, bind the patched file to where the
+old file lies (make sure that the variables `pl_slurm_py_path` and
+`pl_slurm_py_patched_path` are set here as well):
+
+```shell
+apptainer run --bind "$pl_slurm_py_patched_path":"$pl_slurm_py_path" [...]
+```
+
+Of course, other libraries that require it can similarly be patched
+using Apptainer's `--bind` functionality.
+
 ### Advanced PyTorch Distributed debugging
 
 To enable logging for the Python parts of PyTorch Distributed, please
-- 
GitLab