Cannot run multiple maestro applications on the same node under slurm with GNI (Aries) network
When running a second application on a given node, the DRC mechanism returns an error at drc_access()
time indicating the the GNI layer is unhappy about the job configuration.
Summary:
Setup: salloc
3 nodes
srun
one program on node 1 that obtains a DRC token and waits for communication (here: Maestro PM)
srun
a second program on node 2 that uses that token to connect (using libfabric/gni) to node 1
srun
a third program on node 3 that uses the token to connect to 1 and 2
This works as expected.
(In this setup the second and third srun commands use the --jobid
argument to find the SLURM allocation)
Now we try to run program 2 and program 3 on the same node, instead of on separate nodes. Program 2 still starts up fine, but program 3 receives
LIBDRC:COMM:INFO drc_protocol.c:184 - finished recving message, len=80
LIBDRC:PROT:INFO drc_protocol.c:255 - successfully read DRC_MSG_ACCESS_RESP message
LIBDRC:CORE:ERROR rdmacred.c:826 - Application didnt configure the job
LIBDRC:CORE:DEBUG rdmacred.c:591 - disconnecting from /tmp/drcc.sock
LIBDRC:CORE:ERROR rdmacred.c:938 - error returned in access response, rc=8