PM crashes with GNI desc_id assertion
The simple_pool_manager and a single consumer crashes. It seams that those two events happen at the same time and are related to each other. It does not happen at the startup. Sometimes several consumers manage to complete the job before this happens.
The simple_pool_manager crashes with:
libfabric:39035:gni:ep_data:__nic_get_completed_txd():532<warn> [39035:2] CQ error status: SOURCE_SSID:AT_MDD_INV:CPLTN_DREQ simple_pool_manager: prov/gni/src/gnix_nic.c:342: __desc_lkup_by_id: Assertion '(desc_id >= 0) && (desc_id <= nic->max_tx_desc_id)' failed.
and a core file is created.
The consumer crashes with:
[E:cdo] Multio Maestro Syphon - 11:0 5 t46917990029056 (nid00209 27211) 1082534423210226: mstro_cdo_dispose(cdo.c:1864) CDO 'class:rd,date:20190701,domain:g,expver:xxxx,levelist:6,levtype:ml,number:11,param:75,step:10,stream:enfo,time:1200,type:pf' (id 1450a5ee-9ea4-5403-9bee-6f613a27a400.05ac) state 32, not disposable
and a core file is created.
Here is the stat output for the core files for the job-0 (PM) and job-11 (consumer):
mlompar@daint106:/scratch/snx3000/mlompar/kronos-run/output-maestro> stat job-0/core.39035
File: job-0/core.39035
Size: 140339044352 Blocks: 5533104 IO Block: 4194304 regular file
Device: 90201f9ah/2418024346d Inode: 648576201480272819 Links: 1
Access: (0600/-rw-------) Uid: (28160/ mlompar) Gid: (32359/ g129)
Access: 2021-09-21 09:10:14.000000000 +0200
Modify: 2021-09-21 09:11:17.000000000 +0200
Change: 2021-09-21 09:11:17.000000000 +0200
Birth: -
mlompar@daint106:/scratch/snx3000/mlompar/kronos-run/output-maestro> stat job-11/work/core.27211
File: job-11/work/core.27211
Size: 3499388928 Blocks: 3010120 IO Block: 4194304 regular file
Device: 90201f9ah/2418024346d Inode: 648576240453682853 Links: 1
Access: (0600/-rw-------) Uid: (28160/ mlompar) Gid: (32359/ g129)
Access: 2021-09-21 09:10:07.000000000 +0200
Modify: 2021-09-21 09:10:12.000000000 +0200
Change: 2021-09-21 09:10:12.000000000 +0200
Birth: -
It seams that the consumer crashes before the PM.
We are running the new release of maestro-core 519fea86 on Piz Daint.
This is the list of loaded modules:
- module switch PrgEnv-cray PrgEnv-intel
- module load intel
- module load daint-mc
- module load gni-headers
- module load rdma-credentials
- module load craype-hugepages8M
I build maestro-core with --enable-debug option and set the FI_LOG_LEVEL to warn and DRC_DEBUG_LEVEL to DEBUG in order to provide more debug information. MSTRO_LOG_LEVEL is set to 2 for the simple_pool_manager and was disabled for the consumers.
Logs from the PM and the consumer: gni_problem.tgz