Linktest Fails Running Across Modules on JUWELS
Dear Yannik, dear Max,
Did you by any chance try to run linktest using UCX across modules on JUWELS? I could make it run either on the cluster, or on the booster, but when using both I get the following error:
$ srun -N 2 -n 2 xenv -L GCC -L ParaStationMPI ./cluster/linktest-2.1-20210131git6328da8/linktest.ucp --num-warmup-messages 1 --num-messages 10 --len-messages 1024 : -N 1 -n 2 xenv -L GCC -L ParaStationMPI ./booster/linktest-2.1-20210131git6328da8/linktest.ucp --num-warmup-messages 1 --num-messages 10 --len-messages 1024
p_Init(r3): unsupported PMI version received: version=2, subversion=0
p_Init(r2): unsupported PMI version received: version=2, subversion=0
p_Init(r1): unsupported PMI version received: version=2, subversion=0
p_Init(r0): unsupported PMI version received: version=2, subversion=0
[linktest_vcluster_ucp.cc in exchange_address:447] warning: UCP worker address lengths do not match. Expected 283, got 338
linktest.ucp: linktest_vcluster_ucp.cc:453: int ucp::VirtualClusterImpl::exchange_address(int, int, int): Assertion `worker_addr_.len == conn.remote_addr_.len' failed.
[linktest_vcluster_ucp.cc in exchange_address:447] warning: UCP worker address lengths do not match. Expected 338, got 283
linktest.ucp: linktest_vcluster_ucp.cc:453: int ucp::VirtualClusterImpl::exchange_address(int, int, int): Assertion `worker_addr_.len == conn.remote_addr_.len' failed.
[linktest_vcluster_ucp.cc in exchange_address:447] warning: UCP worker address lengths do not match. Expected 283, got 338
linktest.ucp: linktest_vcluster_ucp.cc:453: int ucp::VirtualClusterImpl::exchange_address(int, int, int): Assertion `worker_addr_.len == conn.remote_addr_.len' failed.
[linktest_vcluster_ucp.cc in exchange_address:447] warning: UCP worker address lengths do not match. Expected 338, got 283
linktest.ucp: linktest_vcluster_ucp.cc:453: int ucp::VirtualClusterImpl::exchange_address(int, int, int): Assertion `worker_addr_.len == conn.remote_addr_.len' failed.
Looks to me like there is a bug in the implementation in the exchange_address function. Are you aware of that?
Best, Damian