|
Here we will list fixes to problems that were encountered when building or running Linktest. These fixes may not necessarily be up to date or even work on your system. We hope they will solve your problem or point you in the right direction.
|
|
Here we will list fixes to problems that were encountered when building or running LinkTest. These fixes may not necessarily be up to date or even work on your system. We hope they will solve your problem or point you in the right direction.
|
|
|
|
|
|
[[_TOC_]]
|
|
[[_TOC_]]
|
|
|
|
|
|
# ParastationMPI and `shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device`
|
|
# ParastationMPI and `shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device`
|
|
When using ParastationMPI and testing the MPI communication API using Linktest you may encounter an error-message of the following form:
|
|
When using ParastationMPI and testing the MPI communication API using LinkTest you may encounter an error-message of the following form:
|
|
```
|
|
```
|
|
<PSP:r§TASKID§:shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device>
|
|
<PSP:r§TASKID§:shmget(0, sizeof(shm_com_t), IPC_CREAT | 0777) : No space left on device>
|
|
```
|
|
```
|
|
where `§TASKID§` is a 8-digit zero-filled task id, also known as a rank. If other terminating error messages occur afterwards Linktest will terminate, otherwise it will hang.
|
|
where `§TASKID§` is a 8-digit zero-filled task id, also known as a rank. If other terminating error messages occur afterwards LinkTest will terminate, otherwise it will hang.
|
|
|
|
|
|
This silent error message occurs when so many connections are to be tested that no more shared memory can be allocated on the communication device. This is a hardware limitation, not a software limitation. This commonly occurs when oversubscribing or overcommitting nodes with tasks, for example when using twice as many tasks on a node as it has logical cores.
|
|
This silent error message occurs when so many connections are to be tested that no more shared memory can be allocated on the communication device. This is a hardware limitation, not a software limitation. This commonly occurs when oversubscribing or overcommitting nodes with tasks, for example when using twice as many tasks on a node as it has logical cores.
|
|
|
|
|
... | @@ -17,4 +17,4 @@ At the time of writing three potential solutions exist: |
... | @@ -17,4 +17,4 @@ At the time of writing three potential solutions exist: |
|
1. Test using less tasks. Depending on your requirements this, however, may not be possible.
|
|
1. Test using less tasks. Depending on your requirements this, however, may not be possible.
|
|
2. Use a different MPI implementation like OpenMPI version 4.1.1.
|
|
2. Use a different MPI implementation like OpenMPI version 4.1.1.
|
|
3. Set the `PSP_UCP` environment variable to `2`. This is an undocumented option and it is unknown what it does. This may change in future ParaStation MPI versions. This option will cause ParaStation MPI to use UCP in the background, same as if `PSP_UCP` is set to `1`. Setting `PSP_UCP` to `1`, however, causes MPI to perform differently to if `PSP_UCP` is set to `2` and will still cause the same error. If using this option you may encounter errors simillar or identical to the following: `ib_mlx5_dv.c:160 UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory`. In this case you can either try testing with a different communication API or see 4.
|
|
3. Set the `PSP_UCP` environment variable to `2`. This is an undocumented option and it is unknown what it does. This may change in future ParaStation MPI versions. This option will cause ParaStation MPI to use UCP in the background, same as if `PSP_UCP` is set to `1`. Setting `PSP_UCP` to `1`, however, causes MPI to perform differently to if `PSP_UCP` is set to `2` and will still cause the same error. If using this option you may encounter errors simillar or identical to the following: `ib_mlx5_dv.c:160 UCX ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory`. In this case you can either try testing with a different communication API or see 4.
|
|
4. Upgrade your communication hardware such that it has more memory and can support more connections. Depending on how many connections you want to test this may be enough. |
|
4. Upgrade your communication hardware such that it has more memory and can support more connections. Depending on how many connections you want to test this may be enough. |
|
\ No newline at end of file |
|
|