Skip to content
Snippets Groups Projects
Commit cad4fbff authored by Jan Ebert's avatar Jan Ebert
Browse files

Improve hostname problem description

* Mention at the start that not all systems are affected.
* Mention slower communication.
parent e597c354
Branches
No related tags found
No related merge requests found
...@@ -129,13 +129,14 @@ its distributed system. One part of this endpoint is the master ...@@ -129,13 +129,14 @@ its distributed system. One part of this endpoint is the master
address (`MASTER_ADDR`), which is an IP or hostname that all processes address (`MASTER_ADDR`), which is an IP or hostname that all processes
have to connect to for initialization. We can thankfully obtain the have to connect to for initialization. We can thankfully obtain the
hostname of our job's first allocated node using Slurm relatively hostname of our job's first allocated node using Slurm relatively
easily. However, on JSC systems, we should not use this hostname as it easily. However, on certain JSC systems, we should not use this
is because of awkward naming with regard to InfiniBand network hostname as it is because of awkward naming with regard to InfiniBand
interfaces. If we used this hostname, nodes that are "too far apart" network interfaces. If we used this hostname, nodes that are "too far
would not be able to talk to each other. The simple fix is to append apart" would not be able to talk to each other and communication
an "i" after the hostname to use the correct network interface. Since between nodes would be slower. The simple fix is to append an "i"
this special case only affects some JSC system, we query the machine after the hostname to use the correct network interface. Since this
name and append the "i" only if necessary in the example. special case only affects some JSC system, we query the machine name
and append the "i" only if necessary in the example.
As if all of the above wasn't enough, PyTorch also won't be able to do As if all of the above wasn't enough, PyTorch also won't be able to do
communication via Gloo, the distributed backend it uses for another communication via Gloo, the distributed backend it uses for another
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment