diff --git a/README.md b/README.md index a11c5ede7204afdf0b0d8ffa66b46a5d8f6bc45e..4ba328e39f20c908f74a19fd3a7c06e6295b6e5b 100644 --- a/README.md +++ b/README.md @@ -129,13 +129,14 @@ its distributed system. One part of this endpoint is the master address (`MASTER_ADDR`), which is an IP or hostname that all processes have to connect to for initialization. We can thankfully obtain the hostname of our job's first allocated node using Slurm relatively -easily. However, on JSC systems, we should not use this hostname as it -is because of awkward naming with regard to InfiniBand network -interfaces. If we used this hostname, nodes that are "too far apart" -would not be able to talk to each other. The simple fix is to append -an "i" after the hostname to use the correct network interface. Since -this special case only affects some JSC system, we query the machine -name and append the "i" only if necessary in the example. +easily. However, on certain JSC systems, we should not use this +hostname as it is because of awkward naming with regard to InfiniBand +network interfaces. If we used this hostname, nodes that are "too far +apart" would not be able to talk to each other and communication +between nodes would be slower. The simple fix is to append an "i" +after the hostname to use the correct network interface. Since this +special case only affects some JSC system, we query the machine name +and append the "i" only if necessary in the example. As if all of the above wasn't enough, PyTorch also won't be able to do communication via Gloo, the distributed backend it uses for another