Corrupted CDO size by pool manager
On miniHPC with verbs fabrics.
In a simple benchmark with MPI+Maestro with
- Pool manager (PM)
- One consumer that demands all CDOs (sink all)
- Two producers, each producer one CDO
PM broadcasts its info to all components.
MPI_barier
The consumer declares, requires, demands, and disposes all CDOs (two in this case).
Producers declare and offer CDOs (one per producer)
MPI_barier
Producers withdraw and dispose CDOs.
The PM reports a corrupted CDO size to the consumer. It seems to be a race condition in the PM, because it does not always happen. Also, I do not think it is related to offer/withdraw do not block correctly because I use MPI barrier between all offers/require/demand and withdraws, so blocking should not be the issue.
Please find the code and log attached.