Skip to content
Snippets Groups Projects
Select Git revision
  • 105-memory-leak-in-pm-message-envelope-handling
  • devel default
  • 107-compilation-error-when-building-maestro-core-on-m1-apple-processors
  • 108-implement-cpu-id-query-for-apple-m1-hardware
  • 58-scripting-interface-to-maestro-core
  • 101-need-ci-test-using-installed-maestro
  • 57-sphinx-documentation
  • 104-permit-disabling-memory-pool
  • 103-liberl-installation-issue-on-devel
  • 94-maestro-rdma-transport-ignores-max_msg_size-2
  • main protected
  • 102-possible-race-in-check_pm_redundant_interlock-test
  • 97-check-if-shm-provider-can-be-enabled-after-libfabric-1-14-is-in-our-tree-2
  • 100-include-maestro-attributes-h-cannot-include-mamba-header-from-deps-path
  • 97-check-if-shm-provider-can-be-enabled-after-libfabric-1-14-is-in-our-tree
  • 17-job-failed-282354-needs-update-of-mio-interface-and-build-rules
  • 96-test-libfabric-update-to-1-13-or-1-14
  • feature/stop-telemetry-after-all-left
  • 94-maestro-rdma-transport-ignores-max_msg_size
  • 93-improve-performance-of-mstro_attribute_val_cmp_str
  • v0.3_rc1
  • maestro_d65
  • d65_experiments_20211113
  • v0.2
  • v0.2_rc1
  • d3.3
  • d3.3-review
  • d5.5
  • d5.5-review
  • v0.1
  • d3.2
  • d3.2-draft
  • v0.0
33 results

mamba

Ali Mohammed's avatar
Ali Mohammed authored
update to mamba v0.1.9
f980ef21
History

Mamba: Managed Abstracted Memory Array

A library-based programming model for C, C++ and Fortran based on Managed Abstract Memory Arrays, aiming to deliver simplified and efficient usage of diverse memory systems to application developers in a performance portable way. MAMBA arrays exploit a unified memory interface to abstract memory from both traditional memory devices, accelerators and storage. This library aims to achieve good performance portability with easy to use approach that requires minimal code intrusion. See docs/MambaIntroduction-vX.Y.Z.pdf for an extended introduction, with accompanying slide decks.

How to build and run

mkdir build;
cd build;
# if loop analysis module is required, run autogen.sh instead of autoreconf -i
autoreconf -i
../configure [--prefix=/path/to/install/dir]                   \
             [--enable-discovery[=yes|no|default]];            \
             [--with-fortran]                                  \
             [--with-fortran-ISO-bindings-includedir=/p/a/t/h] \
             [--enable-embedded]                               \
             [--enable-cuda[=yes|no|<arch>]]                   \
             [--enable-hip-rocm[=yes|no]]                      \
             [--enable-opencl[=yes|no]]                        \
             [--with-opencl=/path/to/opencl/install]           \
             [--enable-pmem[=yes|no]]                          \
             [--with-memkind=/path/to/libmemkind/install]      \
             [--with-numa[=/path/to/libnuma/install]]          \
             [--with-loop-analysis]                            \
             [--with-cost-model[=/path/to/costmodel/install]]  \
             [--with-sicm=/path/to/sicm/install]               \
             [--with-umpire=/path/to/umpire/install]           \
             [--with-jemalloc=/path/to/jemalloc/install]       \
             [--with-jemalloc-prefix=<prefix>]                 
make;
make check-tests;
make check-examples;
make install; (optional)

Configure

autogen.sh

Only required to use --with-loop-analysis This will get and update mamba loop analysis dependencies as submodules, and is an optional step if you have already recursively cloned the repository using git clone --recursive. In this case, you may use autoreconf -i instead.

--prefix

Set the directory prefix for make install

--enable-discovery

Enable discovery mode, where Mamba will use hwloc to analyse the memory topology and construct a set of appropriate memory spaces during initialisation. This requires hwloc>=2.0 to be installed. default behaviour is to look for a suitable version of hwloc, and enable discovery if found, otherwise disable and issue a warning message at configure time.

--with-fortran

Build the Fortran Mamba library.

--with-fortran-ISO-bindings-includedir

Specify a non-standard path to the location of ISO_Fortran_binding.h to use the C/Fortran ISO bindings (required for the fortran build)

--enable-embedded

Enable the embedded support generating the libtool convenience libraries to easily import the library and its dependencies into your own project.

--enable-cuda

Enable the CUDA support in the memory manager. The configure lists all the pkg-config module files containing the sub-string 'cuda' and test each until one provides the support requested.

--enable-hip-rocm

Enable HIP support for AMD devices (via ROCM) in the memory manager. We use hipconfig to determine appropriate CFLAGS, see common issues section for info on passing additional hipcc flags

--enable-opencl

Enable OpenCL support, currently tested on AMD and NVIDIA GPU devices and Xilinx FPGA devices

--with-opencl

Provide a non-standard path to your OpenCL installation

--enable-pmem

Enable persistent memory support, such as Intel Optane non-volatile DIMMs. Requires the memkind library.

--with-memkind

Build with libmemkind support, which allows HBM (e.g. Intel KNL MCDRAM) and persistent memory allocation (e.g. Intel Optane NV-DIMMs). Disabled by default;

--with-numa

Build with libnuma support for numa-aware memory spaces.

--with-loop-analysis

Build with loop analysis features. The loop analysis module depends on external loop analysis libraries; during autogen, the appropriate libraries will be downloaded as git submodules. A dependency on LLVM is also introduced, if you have trouble building the loopanalyzer library, refer to the build instructions in the loopanalyzer repository. If you have previously built without this option you will also need to make clean. To test the support libraries, make check will run tests for all dependencies integrated into the Mamba build system.

--with-cost-model

Build with cost model library support for automatic tile sizing features.

--with-sicm

Experimental external library support. Allows underlying memory allocation using the LANL/SICM memory manager.

--with-umpire

Experimental external library support. Allows underlying memory allocation using the LLNL/Umpire memory manager.

--with-jemalloc and --with-jemalloc-prefix

Allows underlying memory allocation using the jemalloc malloc implementation. The default prefix of the jemalloc functions namespace is je_.

Additional options

To change the compiler used, set CC=..., CXX=... and/or FTN=... during configure.

Cray (CCE)

On a Cray system, it is typical to use the compiler wrappers to manage the compilation environment correctly:

./configure CC=cc CXX=CC FTN=ftn ...

GNU

Add -std=gnu11 to get C11 std with gnu extensions, required for posix pthread lock structures.

./configure CFLAGS="-std=gnu11" ...

Configuration variables

In order to modify the default behavior in order to make it fit better your usage, additional compile-time and run-time variables can be set.

Compile-time

The following variables can be set at compile time (or during the call to configure when provided to CPPFLAGS). In order to set their value, use the format -D<name>=<value>.

  • MMB_LOG_LEVEL: Compile-time max log level cut-off, default MMB_LOG_DEBUG
  • MMB_CONFIG_PROVIDER_DEFAULT: Default memory provider to use to allocate memory when none is requested. Default: MMB_NATIVE.
  • MMB_CONFIG_STRATEGY_DEFAULT: Default memory allocation strategy to use when none is requested. Default: MMB_STRATEGY_NONE.
  • MMB_CONFIG_EXECUTION_CONTEXT_GPU_DEFAULT: Default execution context to use when allocating and copying memory to/from GPUs. Default: MMB_GPU_CUDA.
  • MMB_CONFIG_PROVIDER_DEFAULT_ENV_NAME: Environment variable's name to look for when setting default provider. Default: MMB_CONFIG_PROVIDER_DEFAULT.
  • MMB_CONFIG_STRATEGY_DEFAULT_ENV_NAME: Environment variable's name to look for when setting default strategy. Default: MMB_CONFIG_STRATEGY_DEFAULT.
  • MMB_CONFIG_INTERFACE_NAME_DEFAULT_ENV_NAME: Environment variable's name to look for when setting default strategy. Default: MMB_CONFIG_INTERFACE_NAME_DEFAULT.
  • MMB_CONFIG_EXECUTION_CONTEXT_GPU_DEFAULT_ENV_NAME: Environment variable's name to look for when setting default execution context for the GPU. Default: MMB_CONFIG_EXECUTION_CONTEXT_GPU_DEFAULT.

The following variables can be set in the environment at compile time to modify compilation behaviour, use format export <name>=<value> or ./configure <name>=<value>

  • MMB_CONFIG_HIPCC_EXTRA_CPPFLAGS: Extra flags to pass to hipcc compiler during compilation of .hip files.

Run-time

The following variables can be set in the environment at run-time in order to modify some of the compile-time defined behaviors. These variable are read only once, during the library initialization.

  • MMB_CONFIG_PROVIDER_DEFAULT
  • MMB_CONFIG_STRATEGY_DEFAULT
  • MMB_CONFIG_INTERFACE_NAME_DEFAULT
  • MMB_CONFIG_EXECUTION_CONTEXT_GPU_DEFAULT

The variables defaults to the compile-time values. The name of these of these variable can be changed at compile time by setting MMB_CONFIG_PROVIDER_DEFAULT_ENV_NAME, MMB_CONFIG_STRATEGY_DEFAULT_ENV_NAME, MMB_CONFIG_INTERFACE_NAME_DEFAULT_ENV_NAME and MMB_CONFIG_EXECUTION_CONTEXT_GPU_DEFAULT_ENV_NAME respectively. For simplicity, MMB_CONFIG_EXECUTION_CONTEXT_GPU_DEFAULT also accepts NONE as a valid choice.

The following variable can modify the log level at run-time, up to the max compile-time cutoff, and overrides API log level setting.

  • MMB_LOG_LEVEL: Run-time log level setting, cannot override max cutoff defined at compile time.

Examples

Examples are found in mamba/build/examples/, or /path/to/install/dir/examples. Each example is shown in C and fortran, and briefly described here with instructions on use.

1d_array_copy

This shows the construction, tiled initialisation, and copy of a 1d mamba array to another 1d mamba array with matching layout and size, with full error checking.

Source file: examples/c/1d_array_copy.c | examples/fortran/1d_array_copy.f90

Usage: ./1d_array_copy | ./1d_array_copy_f

1d_array_copy_wrapped

The same as 1d_array_copy but using arrays contructed from existing user pointers.

Source file: examples/c/1d_array_copy_wrapped.c | examples/fortran/1d_array_copy_wrapped.f90

Usage: ./1d_array_copy_wrapped | ./1d_array_copy_wrapped_f

tile_duplicate

This shows construction of a 1d array, tiling, duplication and merging of tiles.

Source file: examples/c/tile_duplicate.c

Usage: ./tile_duplicate

matrix_multiply

This demonstrates a tiled matrix multiply using 3 mamba arrays constructed on top of pre-initialised (with random or identity values) matrix buffers.

Source file: examples/c/matrix_multiply.c

Usage: (all args optional): ./matrix_multiply -v (for verbose mode) -t N (for tile size NxN) -m N (for matrix size NxN) -i (use identity for matrix B)

matrix_multiply_cuda (C only)

This demonstrates a tiled matrix multiply using multiple mamba arrays constructed on top of pre-initialised (with random or identity values) matrix buffers. This example also present how to allocate and use memory on different memory devices (DRAM, GPU, HBM, ...), and how to copy from one memory tier to an other. This example shows as well how to use different strategies and/or different memory providers.

This example works the same as the matrix_multiply.c example, excepted that it requires extra steps to pass the data to the actual kernel (in addition to allocate the data on the GPU memory, the tiling information needs to be forwarded as well). The CUDA file only deals with this forwarding (the packing is done in examples/c/matrix_multiply_cuda.c). For now the tiles are not executed in parallel, but it is a work in progress.

Source files: examples/c/matrix_multiply_cuda.c, examples/c/matrix_multiply_cuda_ker.cu, examples/c/matrix_multiply_cuda.h

Usage: (all args optional): ./matrix_multiply_cuda -v (for verbose mode) -t N (for tile size NxN) -m N (for matrix size NxN) -i (use identity for matrix B)

loop description (C only)

This example demonstrates the description of a loop using the loop description, followed by PET/ISL based polyhedral analysis of the loop with dependence computation. The loop description, auxiliary analysis information and calculated loop dependencies are output to the terminal.

Source files: examples/c/loop_description.c

Usage: ./loop_description

report_mem_state (C only)

This example show the output of the function mmb_dump_memory_state that dump to the FILE * given as parameter the current state of the memory system as retained by the MAMBA Memory Manager.

Source file: examples/c/report_mem_state.c

Usage: ./report_mem_state

Common Issues

C standard

If you force standard conformance, with e.g. -std=c11, you may also need to pass something like -D_XOPEN_SOURCE=500 to get required POSIX features. Alternatively use -std=gnu11.

HIP ROCM Support

If you see the following error:

.../hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

You may not have got appropriate HIP ARCH definitions during compilation. This can, for example, occur when compiling on a login node without GPUs attached. If you have appropriate environment/module resolution for this, use that, otherwise you can forward extra cpp args to the hipcc compiler during the Mamba build via the following environment variable, which you need to export prior to configuration:

// Valid for AMD mi60, export before configure
export MMB_CONFIG_HIPCC_EXTRA_CPPFLAGS="-D__HIP_ARCH_GFX906__=1 --cuda-gpu-arch=gfx906"

To check, you can run hipcc --cxxflags and check for something like the above. Setting HIPCC_VERBOSE=7 will additionally provide verbose info from the hipcc compiler.

Furthermore, discovery of AMD GPUs via hwloc is currently not able to find the available memory size, and so memory spaces created automatically during discovery will be of unlimited size (i.e. limited by hip runtime, rather than Mamba).

CUDA

If you see the following error:

no kernel image is available for execution on the device.

You may be using the wrong CUDA architecture for the GPU device available on your node. You can change the architecture used by setting it on your configure line with ./configure --enable-cuda=<arch>. The default architecture we are using is sm_60. It this value is too high you may want to try sm_30.

OpenCL/FPGA

The buffer_copy_opencl example (run automatically during make check or make check-examples) will try to build a kernel at run-time; for most FPGA platforms OpenCL does not have access to a compiler, as such this will likely fail. In order to have this example run, you must build a bitstream for your specifc FPGA that matches the example kernel in examples/c/buffer_copy_opencl.c, and export the path to this bitstream via the environment variable MMB_CONFIG_BUFFER_COPY_OPENCL_BINARY before running the example.