Skip to content
Snippets Groups Projects
Select Git revision
  • devel default
  • 107-compilation-error-when-building-maestro-core-on-m1-apple-processors
  • 108-implement-cpu-id-query-for-apple-m1-hardware
  • 58-scripting-interface-to-maestro-core
  • 101-need-ci-test-using-installed-maestro
  • 57-sphinx-documentation
  • 105-memory-leak-in-pm-message-envelope-handling
  • 104-permit-disabling-memory-pool
  • 103-liberl-installation-issue-on-devel
  • 94-maestro-rdma-transport-ignores-max_msg_size-2
  • main protected
  • 102-possible-race-in-check_pm_redundant_interlock-test
  • 97-check-if-shm-provider-can-be-enabled-after-libfabric-1-14-is-in-our-tree-2
  • 100-include-maestro-attributes-h-cannot-include-mamba-header-from-deps-path
  • 97-check-if-shm-provider-can-be-enabled-after-libfabric-1-14-is-in-our-tree
  • 17-job-failed-282354-needs-update-of-mio-interface-and-build-rules
  • 96-test-libfabric-update-to-1-13-or-1-14
  • feature/stop-telemetry-after-all-left
  • 94-maestro-rdma-transport-ignores-max_msg_size
  • 93-improve-performance-of-mstro_attribute_val_cmp_str
  • v0.3_rc1
  • maestro_d65
  • d65_experiments_20211113
  • v0.2
  • v0.2_rc1
  • d3.3
  • d3.3-review
  • d5.5
  • d5.5-review
  • v0.1
  • d3.2
  • d3.2-draft
  • v0.0
33 results

maestro-core

  • Clone with SSH
  • Clone with HTTPS
  • Maestro Core

    This repository contains the Maestro Core Library, as developed for D3.2. It features the Maestro Core API, used by example code and a MVP demonstrator.

    Installation

    Please refer to INSTALL.md

    Examples

    Please use

    make check

    to build and run the test examples. This may take some time.

    Limits

    Maestro-core needs quite a few file descriptors and also wants to lock pages into memory for RDMA purposes. We try to give a diagnostic message if errors are triggered that may be due to resource constraints. Still, we recommend

    ulimit -n 1024
    ulimit -l 256

    to set at least 1024 file descriptors and 256k of RDMA space.

    Local multithreaded demo (MVP1)

    MVP1 consists in a local multithreaded demo application. More reading (d3.2) here : https://bscw.zam.kfa-juelich.de/bscw/bscw.cgi/2995531

    Reference version is tagged d3.2-draft, on master branch.

    make check also builds the demo executable demo_mvp_d3_2 in addition to examples, and runs it. ./run_demo.sh permits to run the demo alone.

    Adaptive Transport demo

    Pool manager interlock demo uses a three application setup, comprising one pool manager process, and showing GFS and MIO transport. More reading (d5.5 to appear on BSCW) and information on how to setup a VM to run Mero here: https://gitlab.version.fz-juelich.de/maestro/maestro-mero-vm

    Reference version is tagged d5.5-review, on master branch.

    The pool manager interlock demonstration ./tests/check_pm_interlock.sh is automatically launched with make check.

    fabric provider choice/ High-Performance Interconnect usage

    Maestro-core is trying hard to isolate the user from the multitude of network provider choices by using libfabric, and transparently choosing 'the best' connectivity between components. Unfortunately this functionality is not fully working, due to issues in the upstream libfabric code, and in incomplete testing of our usage of it.

    The safest (and lowest performance) connectivity is provided by the sockets provider. You can force usage of that by setting

    FI_PROVIDER=sockets

    in your environment. It should work on most any network that can support TCP/IP networking, including ethernet, IB, and GNI (Aries).

    Usage of the tcp and tcp;ofi_rxm provider is currently broken, an upstream issue is open.

    On Cray XC systems the GNI (Aries) provider is supported. If you compile with the rdma-credentials and gni-headers modules loaded the GNI provider should be autoselected if a GNI NIC is found at runtime.

    NOTE that GNI NICs on login nodes typically do not work, due to a limitation of the libfabric/gni driver, so you will have to run your application exclusively on compute nodes, or manually switch the components running on login nodes to the sockets provider.

    The GNI driver can be forced by setting

    FI_PROVIDER=gni

    If you are using GNI you will implicitly be using Cray libdrc, a mechanism to obtain network authentication tokens. Maestro-core is requesting workflow-level tokens that even support running multiple components of a workflow from different user IDs. In some cases the system may run out of tokens, and there is no user-level token inquiry tool available. If you see failure of GNI startup, try running your application with

    DRC_DEBUG_LEVEL=DEBUG

    and look for an error message like

    LIBDRC:CORE:DEBUG        rdmacred.c:658 - finished acquire request, rc=-28 

    If you see this, contact your system admin to clear cached DRC credentials.

    Documentation

    Doxygen documentation is available and compiled in ./docs folder.

    Common issues/FAQs

    • If you have many network interfaces/many addresses assigned to an interface (may happen with IPv6 rather suddenly) the libfabric setup of the pool manager may hit 'too many open files/errno=-24' issues. Check ulimit -n, and increase the limit.

    • If you see clients stuck at JOIN time while everything else looks good, there is a chance that your firewalling intercepts the packages.