From cad4fbff30f87abac441e24f1051e635ebae2fb7 Mon Sep 17 00:00:00 2001 From: janEbert <janpublicebert@posteo.net> Date: Wed, 14 Aug 2024 12:00:40 +0200 Subject: [PATCH] Improve hostname problem description * Mention at the start that not all systems are affected. * Mention slower communication. --- README.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index a11c5ed..4ba328e 100644 --- a/README.md +++ b/README.md @@ -129,13 +129,14 @@ its distributed system. One part of this endpoint is the master address (`MASTER_ADDR`), which is an IP or hostname that all processes have to connect to for initialization. We can thankfully obtain the hostname of our job's first allocated node using Slurm relatively -easily. However, on JSC systems, we should not use this hostname as it -is because of awkward naming with regard to InfiniBand network -interfaces. If we used this hostname, nodes that are "too far apart" -would not be able to talk to each other. The simple fix is to append -an "i" after the hostname to use the correct network interface. Since -this special case only affects some JSC system, we query the machine -name and append the "i" only if necessary in the example. +easily. However, on certain JSC systems, we should not use this +hostname as it is because of awkward naming with regard to InfiniBand +network interfaces. If we used this hostname, nodes that are "too far +apart" would not be able to talk to each other and communication +between nodes would be slower. The simple fix is to append an "i" +after the hostname to use the correct network interface. Since this +special case only affects some JSC system, we query the machine name +and append the "i" only if necessary in the example. As if all of the above wasn't enough, PyTorch also won't be able to do communication via Gloo, the distributed backend it uses for another -- GitLab