Ensure PM does not leak memory
A large workflow (11 Mio CDOs, 20-300 bytes of attributes each) runs OOM on the PM for @lompar1 in a 60G job allocation.
We need to do two things:
-
do a start-to-finish memory leak check of the pool manager in a scenario similar to what @lompar1 does (some events, some transfers) and ensure no leak aggregation happens. A few 1-time leaks (e.g., on the schema reads) are acceptable. -
Account for the allocations that survive for each CDO that passes through the pool. I believe we have at least one registry entry ‘trace’ that ensures unique names/cdoid mappings survive throughout the workflow, and that was intentional. But that should be a minimal amount (a few dozen bytes per CDO), anything else is bad. Tracing this is hard, we need a malloc library that shows us allocations by size and backtrace of each and count live/dead CDOs against that. -
In the end we should be documenting the expected memory footprint per CDO passed through a workflow in the README, so users can estimate the requirements.
We could also consider a flag to aggressively operate in memory-mininmal mode later (out of scope for this issue).