Skip to content
Snippets Groups Projects
Commit 2c198aa5 authored by Stephan Schulz's avatar Stephan Schulz
Browse files

remove old readme

parent aeb2f609
Branches
Tags
1 merge request!8Refactor
A Load Balancing Library (ALL)
The library aims to provide an easy way to include dynamic domain-based load balancing
into particle based simulation codes. The library is developed in the Simulation Laboratory Molecular Systems of the Juelich Supercomputing Centre at Forschungszentrum Juelich.
It includes or is going to include several load-balancing schemes. The following list gives an overview about these schemes and short explanations ([o]: scheme fully included, [w]: implementation ongoing, [p]: implementation planned)
- [o] Tensor-Product:
The work on all processes is reduced over the cartesian
planes in the systems. This work is then equalized by
adjusting the borders of the cartesian planes.
- [o] Staggered-grid:
A 3-step hierarchical approach is applied, where:
(i) work over the cartesian planes is reduced, before the borders of these
planes are adjusted; (ii) in each of the cartesian planes the work is
reduced for each cartesian column. These columns are then adjusted to each
other to homogenize the work in each column; (iii) the work between
neighboring domains in each column is adjusted. Each adjustment is done
locally with the neighboring planes, columns or domains by adjusting the
adjacent boundaries.
- [w] Topological Mesh:
In contrast to the previous methods this method adjusts
domains not by moving boundaries but vertices, i.e. corner points, of
domains. For each vertex a force, based on the differences
in work of the neighboring domains, is computed and the
vertex is shifted in a way to equalize the work between these
neighboring domains.
- [w] Voronoi Mesh:
Similar to the topological mesh method, this method computes a
force, based on work differences. In contrast to the topological mesh
method, the force acts on a Voronoi point rather than a vertex, i.e. a
point defining a Voronoi cell, which describes the domain. Consequently,
the number of neighbors is not a conserved quantity, i.e. the topology
may change over time. ALL uses the Voro++ library published by the
Lawrance Berkeley Laboratory for the generation of the Voronoi mesh.
- [o] Histogram-based Staggered-grid:
Resulting in the same grid, as the staggered-grid scheme, this scheme uses
the cumulative work function in each of the three cartesian directions in
order to generate this grid. Using histograms and the previously defined
distribution of process domains in a cartesian grid, this scheme generates
in three steps a staggered-grid result, in which the work is distributed as
evenly as the resolution of the underlying histogram allows. In constrast to
the previously mentioned schemes this scheme depends on a global exchange of
work between processes.
- [p] Orthogonal Recursive Bisection
Comparable to the Histogram-based Staggered-grid scheme, this scheme uses
cumulative work functions evaluated by histograms in order to find a new
distribution of workload in a hierarchical manner. In constrast to the
Histogram-based Staggered-grid scheme, the subdivision of domains is not
based on a cartesian division of processes, but on the prime factorization
of the total number of processes. During each step, the current slice of the
system is distributed into smaller sub-slices along the longest edge, that
roughly contain the same workload. The number of sub-slices in then determined
by the corresponding prime number of that step, i.e. when trying to establish
a distribution for 24 (3 * 2 * 2 * 2) processes, the bisection would use four
steps. In the first step the system would be subdivided into 3 subdomains along
the largest edge, each containing roughly the same workload. Each of these subdomains
then is divided into two subsubdomains indepenently. This procedure is then repeated
for all the prime numbers in the prime factorisation.
Installation & Requirements:
Base requirements:
C++11
MPI
CMake 3.14+
Optional requirements:
Fortran 2003
Fortran 2008 (if the more modern Fortran MPI interface is to be used)
VTK 7.1+ (for VTK-based output of the domains)
Boost testing utilities
Doxygen
Sphinx (with breathe extension)
Installation:
1.) Either clone the library from
https://gitlab.version.fz-juelich.de/SLMS/loadbalancing
or download it from the same location into a directory on
your system ($ALL_ROOT_DIR).
2.) Create a build-directory $ALL_BUILD_DIR in $ALL_ROOT_DIR, change into $ALL_BUILD_DIR and call
cmake $ALL_ROOT_DIR <options>
which sets up the installation. There are some optional features,
you can set up with the following options:
-DCMAKE_INSTALL_PREFIX=<$ALL_INSTALL_DIR> [default: depends on system]
sets the directory $ALL_INSTALL_DIR, into which ?make install? copies the compiled library
and examples
-DCM_ALL_VTK_OUTPUT=ON/OFF [default: OFF]
enables/disables the VTK based output of the domains (requires VTK 7.1+)
-DCM_ALL_FORTRAN=ON/OFF [default: OFF]
compiles the Fortran interface and example
-DCM_ALL_VORONOI=ON/OFF [default: OFF]
activates the compilation of the Voronoi-mesh scheme (and the compilation of Voro++)
-DCM_ALL_USE_F08=ON/OFF [default: OFF]
activates the Fortran 2008 MPI interface
-DCM_ALL_TESTING=ON/OFF [default: OFF]
activates unit tests, which can be run with make test
3.) Execute make to compile and install the library to the previously set
directory:
make
make install
After "make install" the compiled library and the compiled examples are located in the directory $ALL_INSTALL_DIR.
Usage:
ALL uses C++ template programming to deal with different data types that describe domain
boundaries and domain based work. In order to capsulate the data, ALL uses a class in which
required data and the computed results of a load-balancing routine are saved and can be
accessed from. To include the library to an existing code, you need to do the following steps:
1.) Create an object of the load-balancing class:
ALL<T,W> ()
ALL<T,W> ( int dimension, T gamma )
ALL<T,W> ( int dimension, std::vector<ALL_Point<T>> vertices, T gamma )
As mentioned before the library uses template programming, where T is the data type used
to describe boundaries between domains and vertices (usually float or double) and W is the
data type used to describe the work-load of a process (usually float or double).
The first version of the constructor defines a base object of the load-balancing class that
contains no data, using a three-dimensional system. The second constructor sets up the
system dimension (currently only three-dimensional systems are supported) and the relaxation
parameter gamma, which controls the convergence of the load-balancing methods. In the third described version
of the ALL<T,W> constructor a set of vertices describing the local domain is already passed, using the
ALL_Point Class, described below.
ALL_Point<T>( const int dimension )
ALL_Point<T>( const int dimension, const T* values )
ALL_Point<T>( const std::vector<T>& values )
T& ALL_Point<T>::operator[]( const int index )
ALL_Point is a class describing a point in space, where the dimension of the space is given
by the input parameter. It can be initialized by either using an array of datatype T or a
std::vector<T>. In the latter case the passing of a dimension is not required and the dimension
of the point is derived from the length of the std::vector. For the initialization with an
array the user has to check that the passed array is of sufficient length, i.e. of length
dimension (or longer).
To update or access initialized ALL_Point objects, the [] operator is overloaded and the n-th
component of a ALL_Point object ?p? can be updated or accessed by p[n].
2.) Setup basic parameters of the system:
There are three input parameters that are required for the library:
a) vertices describing the local domain
b) work load for the local domain
c) cartesian MPI communicator on which the program is executed
(requirement of cartesian communicator is under review)
These parameters can be set with the following methods:
void ALL<T,W>::set_vertices(std::vector<ALL_Point<T>>& vertices)
void ALL<T,W>::set_vertices(const int n, const int dimension, const T*)
void ALL<T,W>::set_communicator(MPI_Comm comm)
void ALL<T,W>::set_work(const W work)
The vertices can be set by either using a std::vector of ALL_Point data, from which the number
of vertices and the dimension of each point can be derived, or by passing an array of data type
T, which requires that the number of vertices and the dimension of the vertices are also passed.
For the MPI communicator the MPI communicator used by the calling program needs to be passed and
for the work a single value of data type W needs to be passed.
3.) Setting up the chosen load balancing routine
To set up the required internal data structures a call of the following method is required:
void ALL<T,W>::setup( short method )
with ?method? being:
ALL_LB_t::TENSOR
ALL_LB_t::STAGGERED
ALL_LB_t::HISTOGRAM
With the keyword method the load balancing strategy is chosen, given by the list above. Starting
point for all methods, described below is a given domain structure (e.g. the one which is initially
set up by the program). In the case of TENSOR and STAGGERED, the domains need to be orthogonal.
For these two methods, the procedure to adjust the work load between domains is a multi-step
approach where in each step, sets of domains are combined into super-domains, the work of which
is mutually adjusted.
Short overview about the methods:
ALL_LB_t::TENSOR
In order to equalize the load of individual domains, the assumption is made that this
can be achieved by equalizing the work in each cartesian direction, i.e. the work
of all domains having the same coordinate in a cartesian direction is collected and
the width of all these domains in this direction is adjusted by comparing this
collective work with the collective work of the neighboring domains. This is done
independently for each cartesian direction in the system.
Required number of vertices:
two, one describing the lower left front point and one describing the
upper right back point of the domain
Advantages:
- topology of the system is maintained (orthogonal domains, neighbor relations)
- no update for the neighbor relations need to be made in the calling code
- if a code was able to deal with orthogonal domains, only small changes are
expected to include this strategy
Disadvantages:
- due to the comparison of collective work loads in cartesian layers and
restrictions resulting from this construction, the final result might
lead to a sub-optimal domain distribution
ALL_LB_t::STAGGERED:
The staggered grid approach is a hierarchical one. In a first step the work of all
domains sharing the same cartesian coordinate with respect to the highest dimension
(z in three dimensions) is collected. Then, like in the TENSOR strategy the layer width in
this dimension is adjusted based on comparison of the collected work with the collected
work of the neighboring domains in the same dimension. As a second step each of these planes
is divided into a set of columns, where all domains share the same cartesian coordinate
in the next lower dimension (y in three dimensions). For each of these columns the
before described procedure is repeated, i.e. work collected and the width of the
columns adjusted accordingly. Finally, in the last step the work of individual domains
is compared to direct neighbors and the width in the last dimension (x in three
dimensions) adjusted. This leads to a staggered grid, that is much better suited to
describe inhomogenous work distributions than the TENSOR strategy.
Required number of vertices:
two, one describing the lower left front point and one describing the
upper right back point of the domain
Advantages:
- very good equalization results for the work load
- maintains orthogonal domains
Disadvantages:
- changes topology of the domains and requires adjustment of neighbor relations
- communication pattern in the calling code might require adjustment to deal
with changing neighbors
ALL_LB_t::HISTOGRAM:
The histogram-based method works differently than the first two methods in such a way, that it is a
global method that requires three distinct steps for a single adjustment. In each of these steps
the following takes place: a partial histogram needs to be created over the workload, e.g. number of particles,
along one direction on each domain, then these are supplied to the method, where a global histogram
is computed. With this global histogram a distribution function is created. This is used to compute the optimal
(possible) width of domains in that direction. For the second and third steps the computation of the global
histograms and distribution functions take place in subsystems, being the results of the previous step.
The result is the most optimal distribution of domains according to the Staggered-grid method, at the cost of
global exchange of work, due to the global adjustment, which makes the method not well suited to highly dynamic
systems, due to the need of frequent updates. On the other hand the method is well suited for static problems,
e.g. grid-based simulations.
Note: Currently the order of dimensions is: z-y-x.
Required number of vertices:
two, one describing the lower left front point and one describing the
upper right back point of the domain
Additional requirements:
partial histogram created over the workload on the local domain in the direction of the current correction step
Advantages:
- supplies an optimal distribution of domains (restricted by width of bins used for the histogram)
- only three steps needed to acquire result
Disadvantages:
- expensive in cost of communication, due to global shifts
- requires preparation of histogram and shift of work between each of the three correction steps
4.) Computing new boundaries / vertices
void ALL<T,W>::balance(short method)
The balance method starts the computation of the new vertices, according to the chosen method.
The chosen method is required to be the same that was used in the call to the setup method. If
required new neighbor relations are computed as well.
5.) Accessing the results
std::vector<ALL_Point<T>>& ALL<T,W>::get_result_vertices()
int ALL<T,W> ALL<T,W>::get_n_neighbors(short method)
void ALL<T,W> ALL<T,W>::get_neighbors(short method, int** neighbors)
void ALL<T,W> ALL<T,W>::get_neighbors(short method, std::vector<int>& neighbors)
In order to access the resulting vertices and neighbors, the above three methods can be used. If the
array version of get_neighbors is used, the address to the neighbor array is returned in
neighbors, therefore a int** is passed.
Examples:
In the distribution two example codes are included, one written in C++, one written in Fortran. The main
goal of these examples is to show how the library can be included into a particle code.
ALL_test:
MPI C++ code that generates a particle distribution over the domains. At program start the domains form
an orthogonal decomposition of the cubic 3d system. Each domain has the same volume as each other domain.
Depending on the cartesian coordinates of the domain in this decomposition, a number of points is created
on each domain. The points are then distributed uniformly over the domain. For the number of points a
polynomial formula is used to create a non-uniform point disribution. As an estimation of the work load
in this program the number of points within a domain was chosen. There is no particle movement included
in the current version of the example code, therefore particle communication only takes place due to the
shift of domain borders.
The program creates three output files in its basic version:
minmax.dat:
contains four columns:
<iteration count> <W_min/W_avg> <W_max_W_avg> <(W_max-W_min)/(W_max+W_min)>
explanation:
In order to try to give a comparable metric for the success of the load balancing procedure
the relatative difference of the minimum and maximum work loads in the system in relation
to the average work load of the system are given. The last column gives an indicator for
the inequality of work in the system, in a perfectly balanced system, this value should be
equal to zero.
stddev.txt:
contains two columns:
<iteration count> <std_dev>
explanation:
The standard deviation of work load over all domains.
domain_data.dat:
contains two larger sets of data:
<rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W>
explanation:
There are two blocks of rows, looking like this, the first block is the starting configuration
of the domains, which should be a uniform grid of orthogonal domains. The second block is the
configuration after 200 load balancing steps. In each line the MPI rank of the domain and the
extension in each cartesian direction is given, as well as the work load (the number of points).
If the library is compiled with VTK support, ALL_test also creates a VTK based output of the domains in each
iteration step. In order to work, a directory named "vtk_outline" needs to be created in the directory where
the executable is located. The resulting output can be visualized with tools like ParaView or VisIt.
ALL_test_f:
The Fortran example provides a more basic version of the test program ALL_test, as its main goal is to show
the functuality of the Fortran interface. The code creates a basic uniform orthogonal domain decomposition
and creates an inhomogenous particle distribution over these. Only one load balancing step is executed and
the program prints out the domain distribution of the start configuration and of the final configuration.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment