UCX tag_match warnings on JUSUF
After getting linktest with the UCX virtual cluster implementation to run on JUSUF the following warning started to be printed to stdout
:
[**********.******] [jsfc***:*****:*] tag_match.c:61 UCX WARN unexpected tag-receive descriptor 0x************* was not matched
*
a numeric/hexadecimal wildcard.
This error indicates that a message was received by the UCX handler, but never requested for inside linktest. This usually means that messages, often MPI, were dropped and as such the application is not MPI compliant. This is not the case in this case. Every message is queried in linktest to ensure good timing information. One cause for these warnings turned out to be the same issue that led to XXX. A lot of UCX functions that initialize the request for something can return success if the request is granted/finished/etc. before the initializing function finishes. In this case not properly testing this and erroneously not waiting for certain requests with tags to complete caused some of these issues. Fixing this alleviated some of the issue, however, the warnings still randomly came up afterwards.
The issue seems to also come from our reuse of tags. This, however, is speculative. The following is speculative. Verifying this would be much more work and would require network level access. When sending PING and PONG messages we use the tags 100 and 101 respectively. If different messages are send/or received with same tag it can happen that when testing a tag the older message is matched, because it has the same tag and is still present. In this case probing the tag returns success, since the old message is present. Then the message is overwritten and the system detects that this message is never queried by Linktest (hence the warning), because Linktest thought the message was successfully received. Basically a race condition occurred.
A possible resolution is to give all messages unique tags, this however wastes buffer space as each tag is given its own unique UCX buffer based on the UCX spec and I do not know how often garbage is collected, buffers are overwritten etc.
Related to this is issue is issue #46 (closed). This issue seems to only occur if the time between subsequent receive requests on the same tag are extremely short. In this issue it is noted that send-recv pairs with switched to-from partners are performed back to back. Removing one of these pairs also fixes the issue on JUSUF. It does not fix the underlying issue that tags are being reused, that would be a lot more work.