Skip to content
Snippets Groups Projects
Commit 18955298 authored by Jan Ebert's avatar Jan Ebert
Browse files

Do not promote module usage for fixing `torchrun`

The module no longer includes patches, so we rely completely on
`torchrun_jsc`. Accordingly, users _have_ to use `torchrun_jsc`.
parent 15a5a05c
No related branches found
No related tags found
No related merge requests found
......@@ -152,29 +152,26 @@ Sadly, there are some issues with this API that make its usage on JSC
systems difficult because of the aforementioned special hostname
handling ([see here for more
information](https://github.com/pytorch/pytorch/issues/73656)).
Thankfully, there are options to fix these issues:
1. Use wrappers, such as
Thankfully, there are options to fix these issues, such as wrappers
like
[`torchrun_jsc`](https://github.com/HelmholtzAI-FZJ/torchrun_jsc).
<!-- For PyTorch ≥2, this wrapper is minimally intrusive and does
not change underlying PyTorch code. Instead, it adds some extra
argument configurations to solve the aforementioned issues. For
PyTorch <2, it instead --> It modifies the underlying code
on-the-fly to fix the issues. `torchrun_jsc`/`python -m
torchrun_jsc` is a drop-in replacement for `torchrun`/`python -m
torch.distributed.run` and can be installed via `pip`: `python -m
pip install torchrun_jsc`.
2. Use PyTorch as provided by the module system. We include patches to
ensure that the errors in `torchrun` are fixed and that it reliably
works on our system.
In our example, we always use the wrapper even if we are already using
the module system to show off how to use it. This way, you can apply
the same template to your own projects that may use `pip`-installed
PyTorch versions, or even a container. This also means that you need
to set up a virtual environment with `torchrun_jsc` installed before
being able to use the example out-of-the-box. This can be done by
executing `nice bash set_up.sh` once on a login node.
<!-- For PyTorch ≥2, this wrapper is minimally intrusive and does not
change underlying PyTorch code. Instead, it adds some extra argument
configurations to solve the aforementioned issues. For PyTorch <2, it
instead --> It modifies the underlying code on-the-fly to fix the
issues. `torchrun_jsc`/`python -m torchrun_jsc` is a drop-in
replacement for `torchrun`/`python -m torch.distributed.run` and can
be installed via `pip`: `python -m pip install torchrun_jsc`.
In our example, we always use the wrapper to show off how to use it.
This way, you can apply the same template to your own projects that
may use `pip`-installed PyTorch versions, or even a container. Despite
the name, `torchrun_jsc` works on _any_ computer since it is basically
just a fixed version of `torchrun`, so you don't need to adapt your
code when switching machines. This also means that you need to set up
a virtual environment with `torchrun_jsc` installed before being able
to use the example out-of-the-box. This can be done by executing `nice
bash set_up.sh` once on a login node.
### Job submission
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment