Here we will list fixes to problems that were encountered when building or running LinkTest. These fixes may not necessarily be up to date or even work on your system. We hope they will solve your problem or point you in the right direction.
shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device
ParastationMPI and When using ParastationMPI and testing the MPI communication API using LinkTest you may encounter an error-message of the following form:
<PSP:r§TASKID§:shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device>
where §TASKID§
is a 8-digit zero-filled task id, also known as a rank. If other terminating error messages occur afterwards LinkTest will terminate, otherwise it will hang.
This silent error message occurs when so many connections are to be tested that no more shared memory can be allocated on the communication device. This is a hardware limitation, not a software limitation. This commonly occurs when oversubscribing or overcommitting nodes with tasks, for example when using twice as many tasks on a node as it has logical cores.
This error was produced using ParaStation MPI version 5.4.9-1.
At the time of writing three potential solutions exist:
- Test using less tasks. Depending on your requirements this, however, may not be possible.
- Use a different MPI implementation like OpenMPI version 4.1.1.
- Set the
PSP_UCP
environment variable to2
. This is an undocumented option and it is unknown what it does. This may change in future ParaStation MPI versions. This option will cause ParaStation MPI to use UCP in the background, same as ifPSP_UCP
is set to1
. SettingPSP_UCP
to1
, however, causes MPI to perform differently to ifPSP_UCP
is set to2
and will still cause the same error. If using this option you may encounter errors simillar or identical to the following:ib_mlx5_dv.c:160 UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory
. In this case you can either try testing with a different communication API or see 4. - Upgrade your communication hardware such that it has more memory and can support more connections. Depending on how many connections you want to test this may be enough.