Skip to content
Snippets Groups Projects
Select Git revision
  • 3bffca3d7bc13a90701818f31953f83ea277ead0
  • master default protected
  • staggered-max
  • unifed-object
  • no_Amalgamated
  • standard_stb
  • parallel_doc_fix
  • release_0.9.3
  • cmake_mpich_tests
  • update_install
  • external_VORO
  • cmake_add_soversion
  • iterative_method
  • cmake_install
  • fixes_042023
  • tensor_max
  • personal/schulz3/tensor_max
  • releases/v0.9
  • ForceBasedDevel
  • refactor
  • feature/simple_test_cases
  • v0.9.3
  • v0.9.2
  • v0.9.1
  • v0.9.0
  • v0.9.0-rc2
26 results

loadbalancing

Rene Halver's avatar
Rene Halver authored
renamed Unstructured to ForceBased (see issue #37) and fixed warning for non-void function reaching end of function for getNeighborVertices()
3bffca3d
History

A Load Balancing Library (ALL)

The library aims to provide an easy way to include dynamic domain-based load balancing into particle based simulation codes. The library is developed in the Simulation Laboratory Molecular Systems of the Juelich Supercomputing Centre at Forschungszentrum Juelich.

It includes or is going to include several load-balancing schemes. The following list gives an overview about these schemes and short explanations ([o]: scheme fully included, [w]: implementation ongoing, [p]: implementation planned)

  • [o] Tensor-Product:
    The work on all processes is reduced over the cartesian planes in the systems. This work is then equalized by adjusting the borders of the cartesian planes.

  • [o] Staggered-grid:
    A 3-step hierarchical approach is applied, where: (i) work over the cartesian planes is reduced, before the borders of these planes are adjusted; (ii) in each of the cartesian planes the work is reduced for each cartesian column. These columns are then adjusted to each other to homogenize the work in each column; (iii) the work between neighboring domains in each column is adjusted. Each adjustment is done locally with the neighboring planes, columns or domains by adjusting the adjacent boundaries.

  • [w] Topological Mesh:
    In contrast to the previous methods this method adjusts domains not by moving boundaries but vertices, i.e. corner points, of domains. For each vertex a force, based on the differences in work of the neighboring domains, is computed and the vertex is shifted in a way to equalize the work between these neighboring domains.

  • [w] Voronoi Mesh:
    Similar to the topological mesh method, this method computes a force, based on work differences. In contrast to the topological mesh method, the force acts on a Voronoi point rather than a vertex, i.e. a point defining a Voronoi cell, which describes the domain. Consequently, the number of neighbors is not a conserved quantity, i.e. the topology may change over time. ALL uses the Voro++ library published by the Lawrance Berkeley Laboratory for the generation of the Voronoi mesh.

  • [o] Histogram-based Staggered-grid: Resulting in the same grid, as the staggered-grid scheme, this scheme uses the cumulative work function in each of the three cartesian directions in order to generate this grid. Using histograms and the previously defined distribution of process domains in a cartesian grid, this scheme generates in three steps a staggered-grid result, in which the work is distributed as evenly as the resolution of the underlying histogram allows. In constrast to the previously mentioned schemes this scheme depends on a global exchange of work between processes.

  • [p] Orthogonal Recursive Bisection Comparable to the Histogram-based Staggered-grid scheme, this scheme uses cumulative work functions evaluated by histograms in order to find a new distribution of workload in a hierarchical manner. In constrast to the Histogram-based Staggered-grid scheme, the subdivision of domains is not based on a cartesian division of processes, but on the prime factorization of the total number of processes. During each step, the current slice of the system is distributed into smaller sub-slices along the longest edge, that roughly contain the same workload. The number of sub-slices in then determined by the corresponding prime number of that step, i.e. when trying to establish a distribution for 24 (3 * 2 * 2 * 2) processes, the bisection would use four steps. In the first step the system would be subdivided into 3 subdomains along the largest edge, each containing roughly the same workload. Each of these subdomains then is divided into two subsubdomains indepenently. This procedure is then repeated for all the prime numbers in the prime factorisation.

Installation & Requirements:

Base requirements:
    C++11
    MPI
    CMake 3.14+

Optional requirements:
    Fortran 2003
    Fortran 2008 (if the more modern Fortran MPI interface is to be used)
    VTK 7.1+ (for VTK-based output of the domains)
    Boost testing utilities
    Doxygen
    Sphinx (with breathe extension)

Installation:

    1.) Either clone the library from 
            https://gitlab.version.fz-juelich.de/SLMS/loadbalancing
        or download it from the same location into a directory on
        your system ($ALL_ROOT_DIR).

    2.) Create a build-directory $ALL_BUILD_DIR in $ALL_ROOT_DIR, change into $ALL_BUILD_DIR and call
            cmake $ALL_ROOT_DIR <options>
        which sets up the installation. There are some optional features,
        you can set up with the following options:
            -DCMAKE_INSTALL_PREFIX=<$ALL_INSTALL_DIR> [default: depends on system]
                sets the directory $ALL_INSTALL_DIR, into which ?make install? copies the compiled library
                and examples
            -DCM_ALL_VTK_OUTPUT=ON/OFF [default: OFF]
                enables/disables the VTK based output of the domains (requires VTK 7.1+)
            -DCM_ALL_FORTRAN=ON/OFF [default: OFF]
                compiles the Fortran interface and example
            -DCM_ALL_VORONOI=ON/OFF [default: OFF]
                activates the compilation of the Voronoi-mesh scheme (and the compilation of Voro++)
            -DCM_ALL_USE_F08=ON/OFF [default: OFF]
                activates the Fortran 2008 MPI interface
            -DCM_ALL_TESTING=ON/OFF [default: OFF]
                activates unit tests, which can be run with make test

    3.) Execute make to compile and install the library to the previously set
        directory:
            make
            make install

After "make install" the compiled library and the compiled examples are located in the directory $ALL_INSTALL_DIR.

Usage:

ALL uses C++ template programming to deal with different data types that describe domain
boundaries and domain based work. In order to capsulate the data, ALL uses a class in which
required data and the computed results of a load-balancing routine are saved and can be
accessed from. To include the library to an existing code, you need to do the following steps:

1.) Create an object of the load-balancing class:
   
        ALL<T,W> ()
        ALL<T,W> ( int dimension, T gamma )  
        ALL<T,W> ( int dimension, std::vector<ALL_Point<T>> vertices, T gamma )

    As mentioned before the library uses template programming, where T is the data type used
    to describe boundaries between domains and vertices (usually float or double) and W is the
    data type used to describe the work-load of a process (usually float or double).
    The first version of the constructor defines a base object of the load-balancing class that
    contains no data, using a three-dimensional system. The second constructor sets up the
    system dimension (currently only three-dimensional systems are supported) and the relaxation
    parameter gamma, which controls the convergence of the load-balancing methods. In the third described version 
    of the ALL<T,W> constructor a set of vertices describing the local domain is already passed, using the
    ALL_Point Class, described below.

        ALL_Point<T>( const int dimension )
        ALL_Point<T>( const int dimension, const T* values )
        ALL_Point<T>( const std::vector<T>& values )

        T& ALL_Point<T>::operator[]( const int index )

    ALL_Point is a class describing a point in space, where the dimension of the space is given
    by the input parameter. It can be initialized by either using an array of datatype T or a
    std::vector<T>. In the latter case the passing of a dimension is not required and the dimension
    of the point is derived from the length of the std::vector. For the initialization with an
    array the user has to check that the passed array is of sufficient length, i.e. of length
    dimension (or longer).
    To update or access initialized ALL_Point objects, the [] operator is overloaded and the n-th
    component of a ALL_Point object ?p? can be updated or accessed by p[n].

2.) Setup basic parameters of the system:
    
    There are three input parameters that are required for the library:

        a) vertices describing the local domain
        b) work load for the local domain
        c) cartesian MPI communicator on which the program is executed
           (requirement of cartesian communicator is under review)

    These parameters can be set with the following methods:

        void ALL<T,W>::set_vertices(std::vector<ALL_Point<T>>& vertices)
        void ALL<T,W>::set_vertices(const int n, const int dimension, const T*)

        void ALL<T,W>::set_communicator(MPI_Comm comm)

        void ALL<T,W>::set_work(const W work)
   
    The vertices can be set by either using a std::vector of ALL_Point data, from which the number
    of vertices and the dimension of each point can be derived, or by passing an array of data type
    T, which requires that the number of vertices and the dimension of the vertices are also passed.
    For the MPI communicator the MPI communicator used by the calling program needs to be passed and
    for the work a single value of data type W needs to be passed.

3.) Setting up the chosen load balancing routine

    To set up the required internal data structures a call of the following method is required:

        void ALL<T,W>::setup( short method )

        with ?method? being:
        
        ALL_LB_t::TENSOR
        ALL_LB_t::STAGGERED
        ALL_LB_t::HISTOGRAM

    With the keyword method the load balancing strategy is chosen, given by the list above. Starting 
    point for all methods, described below is a given domain structure (e.g. the one which is initially 
    set up by the program). In the case of TENSOR and STAGGERED, the domains need to be orthogonal. 
    For these two methods, the procedure to adjust the work load between domains is a multi-step 
    approach where in each step, sets of domains are combined into super-domains, the work of which 
    is mutually adjusted. 

    Short overview about the methods:

        ALL_LB_t::TENSOR
        
            In order to equalize the load of individual domains, the assumption is made that this
            can be achieved by equalizing the work in each cartesian direction, i.e. the work
            of all domains having the same coordinate in a cartesian direction is collected and
            the width of all these domains in this direction is adjusted by comparing this
            collective work with the collective work of the neighboring domains. This is done
            independently for each cartesian direction in the system.

            Required number of vertices:
                two, one describing the lower left front point and one describing the 
                upper right back point of the domain

            Advantages:
                - topology of the system is maintained (orthogonal domains, neighbor relations)
                - no update for the neighbor relations need to be made in the calling code
                - if a code was able to deal with orthogonal domains, only small changes are 
                  expected to include this strategy
            
            Disadvantages:
                - due to the comparison of collective work loads in cartesian layers and 
                  restrictions resulting from this construction, the final result might 
                  lead to a sub-optimal domain distribution

        ALL_LB_t::STAGGERED:

            The staggered grid approach is a hierarchical one. In a first step the work of all
            domains sharing the same cartesian coordinate with respect to the highest dimension
            (z in three dimensions) is collected. Then, like in the TENSOR strategy the layer width in
            this dimension is adjusted based on comparison of the collected work with the collected
            work of the neighboring domains in the same dimension. As a second step each of these planes
            is divided into a set of columns, where all domains share the same cartesian coordinate
            in the next lower dimension (y in three dimensions). For each of these columns the
            before described procedure is repeated, i.e. work collected and the width of the
            columns adjusted accordingly. Finally, in the last step the work of individual domains
            is compared to direct neighbors and the width in the last dimension (x in three
            dimensions) adjusted. This leads to a staggered grid, that is much better suited to
            describe inhomogenous work distributions than the TENSOR strategy.

            Required number of vertices:
                two, one describing the lower left front point and one describing the 
                upper right back point of the domain

            Advantages:
                - very good equalization results for the work load
                - maintains orthogonal domains

            Disadvantages:
                - changes topology of the domains and requires adjustment of neighbor relations
                - communication pattern in the calling code might require adjustment to deal
                  with changing neighbors

        ALL_LB_t::HISTOGRAM:

            The histogram-based method works differently than the first two methods in such a way, that it is a
            global method that requires three distinct steps for a single adjustment. In each of these steps
            the following takes place: a partial histogram needs to be created over the workload, e.g. number of particles,
            along one direction on each domain, then these are supplied to the method, where a global histogram
            is computed. With this global histogram a distribution function is created. This is used to compute the optimal
            (possible) width of domains in that direction. For the second and third steps the computation of the global
            histograms and distribution functions take place in subsystems, being the results of the previous step. 
            The result is the most optimal distribution of domains according to the Staggered-grid method, at the cost of
            global exchange of work, due to the global adjustment, which makes the method not well suited to highly dynamic
            systems, due to the need of frequent updates. On the other hand the method is well suited for static problems,
            e.g. grid-based simulations.

            Note: Currently the order of dimensions is: z-y-x.

            Required number of vertices:
                two, one describing the lower left front point and one describing the 
                upper right back point of the domain

            Additional requirements:
                partial histogram created over the workload on the local domain in the direction of the current correction step

            Advantages:
                - supplies an optimal distribution of domains (restricted by width of bins used for the histogram)
                - only three steps needed to acquire result

            Disadvantages:
                - expensive in cost of communication, due to global shifts
                - requires preparation of histogram and shift of work between each of the three correction steps

4.) Computing new boundaries / vertices

        void ALL<T,W>::balance(short method)
    
    The balance method starts the computation of the new vertices, according to the chosen method.
    The chosen method is required to be the same that was used in the call to the setup method. If
    required new neighbor relations are computed as well.

 5.) Accessing the results

        std::vector<ALL_Point<T>>& ALL<T,W>::get_result_vertices()
        
        int ALL<T,W> ALL<T,W>::get_n_neighbors(short method)
        void ALL<T,W> ALL<T,W>::get_neighbors(short method, int** neighbors)

        void ALL<T,W> ALL<T,W>::get_neighbors(short method, std::vector<int>& neighbors)


    In order to access the resulting vertices and neighbors, the above three methods can be used. If the
    array version of get_neighbors is used, the address to the neighbor array is returned in
    neighbors, therefore a int** is passed.

Examples:

In the distribution two example codes are included, one written in C++, one written in Fortran. The main
goal of these examples is to show how the library can be included into a particle code.

ALL_test: 

    MPI C++ code that generates a particle distribution over the domains. At program start the domains form
    an orthogonal decomposition of the cubic 3d system. Each domain has the same volume as each other domain.
    Depending on the cartesian coordinates of the domain in this decomposition, a number of points is created
    on each domain. The points are then distributed uniformly over the domain. For the number of points a
    polynomial formula is used to create a non-uniform point disribution. As an estimation of the work load
    in this program the number of points within a domain was chosen. There is no particle movement included
    in the current version of the example code, therefore particle communication only takes place due to the
    shift of domain borders.

    The program creates three output files in its basic version:

        minmax.dat:
            contains four columns:
                <iteration count> <W_min/W_avg> <W_max_W_avg> <(W_max-W_min)/(W_max+W_min)>
            explanation:
                In order to try to give a comparable metric for the success of the load balancing procedure
                the relatative difference of the minimum and maximum work loads in the system in relation
                to the average work load of the system are given. The last column gives an indicator for
                the inequality of work in the system, in a perfectly balanced system, this value should be
                equal to zero.
        
        stddev.txt:
            contains two columns:
                <iteration count> <std_dev>
            explanation:
                The standard deviation of work load over all domains.

        domain_data.dat:
            contains two larger sets of data:
                <rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W>
            explanation:
                There are two blocks of rows, looking like this, the first block is the starting configuration
                of the domains, which should be a uniform grid of orthogonal domains. The second block is the
                configuration after 200 load balancing steps. In each line the MPI rank of the domain and the
                extension in each cartesian direction is given, as well as the work load (the number of points).

    If the library is compiled with VTK support, ALL_test also creates a VTK based output of the domains in each
    iteration step. In order to work, a directory named "vtk_outline" needs to be created in the directory where
    the executable is located. The resulting output can be visualized with tools like ParaView or VisIt.

ALL_test_f:

    The Fortran example provides a more basic version of the test program ALL_test, as its main goal is to show
    the functuality of the Fortran interface. The code creates a basic uniform orthogonal domain decomposition
    and creates an inhomogenous particle distribution over these. Only one load balancing step is executed and
    the program prints out the domain distribution of the start configuration and of the final configuration.