Allocated memory buffers are slow.
We have a problem with the memory allocator. The allocated buffers are awfully slow which kills performance. The problem has either to do with Pinned memory-mapped allocator or with the C++ structure itself. I did some tests that shows that copying from it is slower than copying from a POSIX memory aligned buffer:
// The buf_Multi is a vector of unique pointers to memory buffers
void** const sendbuf=(void**)malloc(sizeof(void*)*num_msg);
for(auto k=0;k<num_msg;k++){
posix_memalign(&sendbuf[k],align,size_msg);
memset(sendbuf[k],1,size_msg);
}
void** const recvbuf=(void**)malloc(sizeof(void*)*num_msg);
for(auto k=0;k<num_msg;k++) posix_memalign(&recvbuf[k],align,size_msg);
// Our Memory Buffer to POSIX Buffer
tv=MPI_Wtime();
for(auto k=0;k<num_msg;k++)memcpy(recvbuf[k],buf_multi[k]->pointer<void>(),buf_multi[k]->len());
tv=(MPI_Wtime()-tv)/num_msg;
printf("\nBuffer Copy Pinned Time = %f s (RANK=%d, SIZE=%d)\n",tv,rank(),buf_multi[0]->len());fflush(stdout);
// POSIX to POSIX
tv=MPI_Wtime();
for(auto k=0;k<num_msg;k++)memcpy(recvbuf[k],sendbuf[k],size_msg);
tv=(MPI_Wtime()-tv)/num_msg;
printf("\nBuffer Copy POSIX Time = %f s (RANK=%d, SIZE=%d)\n",tv,rank(),size_msg);fflush(stdout);
The results for 1000 messages are:
Buffer Copy Pinned Time = 0.004011 s (RANK=0, SIZE=8388608)
Buffer Copy POSIX Time = 0.001913 s (RANK=0, SIZE=8388608)
We see that our memory-mapped solution is a factor of 2 slower. This is also the case when we use our solution as a destination. This might explain why our performance is worse than the OMB suite.
Edited by Max Holicki