ParaStationMPI/5.4.9-1 and running out of device shared memory
When oversubscribing a node with Linktest tasks with ParaStationMPI/5.4.9-1
the following silent errors in random order occur partway through Linktest execution:
<PSP:r§TASKID§:shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device>
where §TASKID§
is a 8-digit zero-filled task id, AKA rank. Evidently some device attached to the main node runs out of memory. It is not the main RAM. This is a silent error and can result in Linktest dead locking if no other terminating errors occur afterwards.
Admins suggested to try out the following two environment variables:
-
PSP_ONDEMAND=1
: Parastation environment variable that if defined causes buffers to only be created on demand. This negatively affects the initial latency of a connection and does not resolve the issue in this case as buffers are still required for all connections to be tested. -
PSP_UCP=2
: Undocumented Parastation Option for the environment variablePSP_UCP
. This seems to solve the issue. According to the documentation: "Default for UCP is 'off', but with a higher minor priority than OPENIB. With PSP_UCP=1 ucp will be used." Note that with 'off' they mean that the variable is undefined, setting the variable to0
causes the same error as above partway through execution followed by MPI error messages related to missing protocols.
Another solution is to either use Intel- or Open-MPI. They do not seem to have issues.
Should we make a Wiki page Help
or Common Problems/Challenges
, etc. I do not like either name since this problem is not our fault.
Edited by Max Holicki