Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
P
PyTorch at JSC
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Simulation and Data Lab Applied Machine Learning
PyTorch at JSC
Commits
fbe9d453
Commit
fbe9d453
authored
9 months ago
by
Jan Ebert
Browse files
Options
Downloads
Patches
Plain Diff
Add section on Apptainers
parent
f62bb14d
Branches
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+81
-0
81 additions, 0 deletions
README.md
with
81 additions
and
0 deletions
README.md
+
81
−
0
View file @
fbe9d453
...
@@ -31,6 +31,7 @@ https://medium.com/pytorch/pytorch-data-parallel-best-practices-on-google-cloud-
...
@@ -31,6 +31,7 @@ https://medium.com/pytorch/pytorch-data-parallel-best-practices-on-google-cloud-
-
[
GPU architecture selection
](
#gpu-architecture-selection
)
-
[
GPU architecture selection
](
#gpu-architecture-selection
)
-
[
Kernels not being compiled
](
#kernels-not-being-compiled
)
-
[
Kernels not being compiled
](
#kernels-not-being-compiled
)
-
[
PyTorch Lightning
](
#pytorch-lightning
)
-
[
PyTorch Lightning
](
#pytorch-lightning
)
-
[
Apptainers
](
#apptainers
)
-
[
Advanced PyTorch Distributed debugging
](
#advanced-pytorch-distributed-debugging
)
-
[
Advanced PyTorch Distributed debugging
](
#advanced-pytorch-distributed-debugging
)
-
[
DDP
](
#ddp
)
-
[
DDP
](
#ddp
)
-
[
DDP considerations
](
#ddp-considerations
)
-
[
DDP considerations
](
#ddp-considerations
)
...
@@ -385,6 +386,86 @@ def patch_lightning_slurm_master_addr():
...
@@ -385,6 +386,86 @@ def patch_lightning_slurm_master_addr():
patch_lightning_slurm_master_addr
()
patch_lightning_slurm_master_addr
()
```
```
### Apptainers
At JSC, the available container runtime is Apptainer, which – in JSC's
configuration – uses read-only containers that we cannot modify or use
file system overlays with. This causes problems with certain versions
of some libraries, such as Triton, in which case an undynamic
implementation was present for some time. Before we look at how to
handle older versions of Triton (which require more work), though,
here is how to specify a CUDA library path for recent Triton versions
in case you encounter errors relating to it:
```
shell
# Replace the path according to your container.
export
TRITON_LIBCUDA_PATH
=
/usr/local/cuda/lib64/stubs
```
In certain development versions of Triton 2.1.0 (more specifcally,
from commit
`c9ab44888ed445acf7acb7d377aae98e07630015`
up to and
excluding commit
`46452fae3bb072b9b8da4d1529a0af7c8f233de5`
), it was
not possible to specify CUDA library paths. Because a specific
hardcoded query was used to obtain the library paths, we would have to
modify the container to update the queried library path caches, which
we are unable to do (due to Apptainers being read-only and JSC not
allowing overlays). A solution is to overwrite the erroneous file that
contains the query with a patched version at runtime using Apptainer's
`--bind`
argument.
For simplicity's sake, let's assume that you have already set the
environment variable
`TRITON_LIBCUDA_PATH`
to the correct location
containing CUDA libraries (even though the varibale has no meaning in
this Triton version). This has the added benefit of future Triton
versions working automatically in case you decide to update the
container or Triton version. You also need to identify where
`triton/common/build.py`
lies in the container. Ideally, you have an
error message that helps with this. In our example container, the full
path to
`triton/common/build.py`
is
`/usr/local/lib/python3.10/dist-packages/triton/common/build.py`
,
which we store in the environment variable
`old_triton_build_py_path`
.
First, create the patched file from inside the container:
```
shell
# This has to be executed from inside the container!
triton_build_py_path
=
/usr/local/lib/python3.10/dist-packages/triton/common/build.py
# Where to place the patched file. Has to be a location outside the container.
triton_build_py_patched_path
=
./triton-build-patched.py
sed
's:libs = subprocess\..*$:libs = "'
"
$TRITON_LIBCUDA_PATH
"
'":g'
"triton_build_py_path"
>
"
$triton_build_py_patched_path
"
```
Then, when executing the container, bind the patched file to where the
old file lies (make sure that the variables
`triton_build_py_path`
and
`triton_build_py_patched_path`
are set here as well):
```
shell
apptainer run
--bind
"
$triton_build_py_patched_path
"
:
"
$triton_build_py_path
"
[
...]
```
The same strategy can also be used to patch apply the
[
aforementioned
patches to PyTorch Lightning
](
#pytorch-lightning
)
, with an example for
PyTorch Lightning ≥2 here:
```
shell
# This has to be executed from inside the container!
pl_slurm_py_path
=
/usr/local/lib/python3.10/dist-packages/lightning_fabric/plugins/environments/slurm.py
# Where to place the patched file. Has to be a location outside the container.
pl_slurm_py_patched_path
=
./pl-slurm-patched.py
sed
's:root_node = \(self\.resolve_root_node_address(.*)\)$:root_node = os.getenv("MASTER_ADDR", \1):g'
"
$pl_slurm_py_path
"
>
"
$pl_slurm_py_patched_path
"
```
Then, when executing the container, bind the patched file to where the
old file lies (make sure that the variables
`pl_slurm_py_path`
and
`pl_slurm_py_patched_path`
are set here as well):
```
shell
apptainer run
--bind
"
$pl_slurm_py_patched_path
"
:
"
$pl_slurm_py_path
"
[
...]
```
Of course, other libraries that require it can similarly be patched
using Apptainer's
`--bind`
functionality.
### Advanced PyTorch Distributed debugging
### Advanced PyTorch Distributed debugging
To enable logging for the Python parts of PyTorch Distributed, please
To enable logging for the Python parts of PyTorch Distributed, please
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment