From fbe9d453dca1fa84fb4921ca375aca5ec593c6aa Mon Sep 17 00:00:00 2001 From: janEbert <janpublicebert@posteo.net> Date: Wed, 18 Sep 2024 22:40:35 +0200 Subject: [PATCH] Add section on Apptainers --- README.md | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) diff --git a/README.md b/README.md index 4ba328e..e479f86 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,7 @@ https://medium.com/pytorch/pytorch-data-parallel-best-practices-on-google-cloud- - [GPU architecture selection](#gpu-architecture-selection) - [Kernels not being compiled](#kernels-not-being-compiled) - [PyTorch Lightning](#pytorch-lightning) + - [Apptainers](#apptainers) - [Advanced PyTorch Distributed debugging](#advanced-pytorch-distributed-debugging) - [DDP](#ddp) - [DDP considerations](#ddp-considerations) @@ -385,6 +386,86 @@ def patch_lightning_slurm_master_addr(): patch_lightning_slurm_master_addr() ``` +### Apptainers + +At JSC, the available container runtime is Apptainer, which – in JSC's +configuration – uses read-only containers that we cannot modify or use +file system overlays with. This causes problems with certain versions +of some libraries, such as Triton, in which case an undynamic +implementation was present for some time. Before we look at how to +handle older versions of Triton (which require more work), though, +here is how to specify a CUDA library path for recent Triton versions +in case you encounter errors relating to it: + +```shell +# Replace the path according to your container. +export TRITON_LIBCUDA_PATH=/usr/local/cuda/lib64/stubs +``` + +In certain development versions of Triton 2.1.0 (more specifcally, +from commit `c9ab44888ed445acf7acb7d377aae98e07630015` up to and +excluding commit `46452fae3bb072b9b8da4d1529a0af7c8f233de5`), it was +not possible to specify CUDA library paths. Because a specific +hardcoded query was used to obtain the library paths, we would have to +modify the container to update the queried library path caches, which +we are unable to do (due to Apptainers being read-only and JSC not +allowing overlays). A solution is to overwrite the erroneous file that +contains the query with a patched version at runtime using Apptainer's +`--bind` argument. + +For simplicity's sake, let's assume that you have already set the +environment variable `TRITON_LIBCUDA_PATH` to the correct location +containing CUDA libraries (even though the varibale has no meaning in +this Triton version). This has the added benefit of future Triton +versions working automatically in case you decide to update the +container or Triton version. You also need to identify where +`triton/common/build.py` lies in the container. Ideally, you have an +error message that helps with this. In our example container, the full +path to `triton/common/build.py` is +`/usr/local/lib/python3.10/dist-packages/triton/common/build.py`, +which we store in the environment variable `old_triton_build_py_path`. + +First, create the patched file from inside the container: + +```shell +# This has to be executed from inside the container! +triton_build_py_path=/usr/local/lib/python3.10/dist-packages/triton/common/build.py +# Where to place the patched file. Has to be a location outside the container. +triton_build_py_patched_path=./triton-build-patched.py +sed 's:libs = subprocess\..*$:libs = "'"$TRITON_LIBCUDA_PATH"'":g' "triton_build_py_path" > "$triton_build_py_patched_path" +``` + +Then, when executing the container, bind the patched file to where the +old file lies (make sure that the variables `triton_build_py_path` and +`triton_build_py_patched_path` are set here as well): + +```shell +apptainer run --bind "$triton_build_py_patched_path":"$triton_build_py_path" [...] +``` + +The same strategy can also be used to patch apply the [aforementioned +patches to PyTorch Lightning](#pytorch-lightning), with an example for +PyTorch Lightning ≥2 here: + +```shell +# This has to be executed from inside the container! +pl_slurm_py_path=/usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/slurm.py +# Where to place the patched file. Has to be a location outside the container. +pl_slurm_py_patched_path=./pl-slurm-patched.py +sed 's:root_node = \(self\.resolve_root_node_address(.*)\)$:root_node = os.getenv("MASTER_ADDR", \1):g' "$pl_slurm_py_path" > "$pl_slurm_py_patched_path" +``` + +Then, when executing the container, bind the patched file to where the +old file lies (make sure that the variables `pl_slurm_py_path` and +`pl_slurm_py_patched_path` are set here as well): + +```shell +apptainer run --bind "$pl_slurm_py_patched_path":"$pl_slurm_py_path" [...] +``` + +Of course, other libraries that require it can similarly be patched +using Apptainer's `--bind` functionality. + ### Advanced PyTorch Distributed debugging To enable logging for the Python parts of PyTorch Distributed, please -- GitLab