Skip to content
Snippets Groups Projects
Commit cad4fbff authored by Jan Ebert's avatar Jan Ebert
Browse files

Improve hostname problem description

* Mention at the start that not all systems are affected.
* Mention slower communication.
parent e597c354
No related branches found
No related tags found
No related merge requests found
......@@ -129,13 +129,14 @@ its distributed system. One part of this endpoint is the master
address (`MASTER_ADDR`), which is an IP or hostname that all processes
have to connect to for initialization. We can thankfully obtain the
hostname of our job's first allocated node using Slurm relatively
easily. However, on JSC systems, we should not use this hostname as it
is because of awkward naming with regard to InfiniBand network
interfaces. If we used this hostname, nodes that are "too far apart"
would not be able to talk to each other. The simple fix is to append
an "i" after the hostname to use the correct network interface. Since
this special case only affects some JSC system, we query the machine
name and append the "i" only if necessary in the example.
easily. However, on certain JSC systems, we should not use this
hostname as it is because of awkward naming with regard to InfiniBand
network interfaces. If we used this hostname, nodes that are "too far
apart" would not be able to talk to each other and communication
between nodes would be slower. The simple fix is to append an "i"
after the hostname to use the correct network interface. Since this
special case only affects some JSC system, we query the machine name
and append the "i" only if necessary in the example.
As if all of the above wasn't enough, PyTorch also won't be able to do
communication via Gloo, the distributed backend it uses for another
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment