From 25511aa0d9167d72b65dafa66c740edf329335b7 Mon Sep 17 00:00:00 2001 From: Stephan Schulz <stephan.schulz-x2q@rub.de> Date: Tue, 24 Nov 2020 17:10:38 +0100 Subject: [PATCH] first version of new version of README --- README_new.md | 396 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 396 insertions(+) create mode 100644 README_new.md diff --git a/README_new.md b/README_new.md new file mode 100644 index 0000000..6cfd91a --- /dev/null +++ b/README_new.md @@ -0,0 +1,396 @@ +# A Load Balancing Library (ALL) + +The library aims to provide an easy way to include dynamic domain-based +load balancing into particle based simulation codes. The library is +developed in the Simulation Laboratory Molecular Systems of the Jülich +Supercomputing Centre at Forschungszentrum Jülich. + +It includes several load balancing schemes. The following list gives and +overview about these schemes and short explanations. + + - **Tensor product**: The work on all processes is reduced over the + cartesian planes in the systems. This work is then equalized by + adjusting the borders of the cartesian planes. + - **Staggered grid**: A 3-step hierarchical approach is applied, where: + 1. work over the cartesian planes is reduced, before the borders of + these planes are adjusted + 2. in each of the cartesian planes the work is reduced for each + cartesian column. These columns are then adjusted to each other to + homogenize the work in each column + 3. the work between neighboring domains in each column is adjusted. + Each adjustment is done locally with the neighboring planes, + columns or domains by adjusting the adjacent boundaries. + - **Topological mesh** (WIP): In contrast to the previous methods this + method adjusts domains not by moving boundaries but vertices, i.e. + corner points, of domains. For each vertex a force, based on the + differences in work of the neighboring domains, is computed and the + vertex is shifted in a way to equalize the work between these + neighboring domains. + - **Voronoi mesh** (WIP): Similar to the topological mesh method, this + method computes a force, based on work differences. In contrast to + the topological mesh method, the force acts on a Voronoi point rather + than a vertex, i.e. a point defining a Voronoi cell, which describes + the domain. Consequently, the number of neighbors is not a conserved + quantity, i.e. the topology may change over time. ALL uses the Voro++ + library published by the Lawrence Berkeley Laboratory for the + generation of the Voronoi mesh. + - **Histogram based staggered grid**: Resulting in the same grid, as + the staggered grid scheme, this scheme uses the cumulative work + function in each of the three cartesian directions in order to + generate this grid. Using histograms and the previously defined + distribution of process domains in a cartesian grid, this scheme + generates in three steps a staggered-grid result, in which the work + is distributed as evenly as the resolution of the underlying + histogram allows. In contrast to the previously mentioned schemes + this scheme depends on a global exchange of work between processes. + - **Orthogonal recursive bisection** (Planned): Comparable to the + histogram based staggered grid scheme, this scheme uses cumulative + work functions evaluated by histograms in order to find a new + distribution of workload in a hierarchical manner. In contrast to the + histogram based staggered grid scheme, the subdivision of domains is + not based on a cartesian division of processes, but on the prime + factorization of the total number of processes. During each step, the + current slice of the system is distributed into smaller sub-slices + along the longest edge, that roughly contain the same workload. The + number of sub-slices in then determined by the corresponding prime + number of that step, i.e. when trying to establish a distribution for + 24 (3 * 2 * 2 * 2) processes, the bisection would use four steps. In + the first step the system would be subdivided into 3 sub domains + along the largest edge, each containing roughly the same workload. + Each of these sub domains then is divided into two sub sub domains + independently. This procedure is then repeated for all the prime + numbers in the prime factorization. + + +## Installation and Requirements + +### Requirements +Base requirements: + + - C++11 capable compiler + - MPI support + - CMake v. 3.14 or higher + +Optional requirements: + + - Fortran 2003 capable compiler (Fortran interface) + - Fortran 2008 capable compiler (Usage of the `mpi_f08` interface) + - VTK 7.1 or higher (Domain output) + - Boost testing utilities + - Doxygen and Sphinx with `breathe` (Documentation) + +### Installation + + 1. Clone the library from + `https://gitlab.version.fz-juelich.de/SLMS/loadbalancing` into + `$ALL_ROOT_DIR`. + 2. Create the build directory `$ALL_BUILD_DIR` some place else. + 3. Call `cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"` to set up the + installation. Compile time features can be selected from: + - `-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR` (default: system + dependant): Install into `$ALL_INSTALL_DIR` after compiling. + - `-DCM_ALL_VTK_OUTPUT=ON` (default: `OFF`): Enable VTK output + of domains. Requires VTK 7.1 or higher. + - `-DCM_ALL_FORTRAN=ON` (default: `OFF`): Enable Fortran interface. + Requires Fortran 2003 capable compiler. + - `-DCM_ALL_USE_F08=ON` (default: `OFF`): Enable usage of `mpi_f08` + module for MPI. Requires Fortran 2008 capable compiler and + compatible MPI installation. + - `-DCM_ALL_VORONOI=ON` (default: `OFF`): Enable Voronoi mesh + method and subsequently compilation and linkage of Voro++. + - `-DCM_ALL_TESTING=ON` (default: `OFF`): Turns on unit and feature + tests. Can be run from the build directory with `ctest`. Requires + the Boost test utilities. + - `-DCMAKE_BUILD_TYPE=Debug` (default: `Release`): Enable library + internal debugging features. Using `DebugWithOpt` also turns on + some optimizations. + - `-DCM_ALL_DEBUG=ON` (default: `OFF`): Enable self consistency + checks within the library itself. + - `-DVTK_DIR=$VTK_DIR` (default: system dependant): Path to an + explicit VTK installation. Make sure that VTK is compiled with + the same MPI as the library and subsequently your code. + To use a specific compiler and Boost installation use: `CC=gcc + CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`. + 4. To build and install the library then run: `cmake --build + "$ALL_BUILD_DIR"` and `cmake --install "$ALL_BUILD_DIR"`. + Afterwards, the built examples and library files are placed in + `$ALL_INSTALL_DIR`. + + +## Usage +The library is presented as a single load balancing object `ALL::ALL` +from `ALL.hpp`. All classes are encapsulated in the `ALL` name space. +Simple (and not so simple) examples are available in the `/examples` sub +directory. The simple examples are also documented at length. + +### ALL object +The ALL object can be constructed with + + ALL::ALL<T,W> ( + const ALL::LB_t method, + const int dimension, + const T gamma) + +Where the type of the boundaries is given by `T` and the type of the +work as `W`, which are usually float or double. The load balancing +method must be one of `ALL::LB_t::TENSOR`, `...STAGGERED`, +`...UNSTRUCTURED`, `...VORONOI`, `...HISTOGRAM`. Where the Voronoi +method must be enabled at compile time. There is also a second form +where the initial domain vertices are given: + + ALL::ALL<T,W> ( + const ALL::LB_t method, + const int dimension, + const std::vector<ALL::Point<T>> &initialDomain, + const T gamma) + +### Point object +The domain boundaries are given and retrieved as a vector of +`ALL::Point`s. These can be constructed via + + ALL::Point<T> ( const int dimension ) + ALL::Point<T> ( const int dimension, const T *values ) + ALL::Point<T> ( const std::vector<T> &values ) + +and accessed via the `[]` operator like a `std::vector` + +### Set up +A few additional parameters must be set before the domains can be +balanced. Exactly which are dependent on the method used, but in general +the following are necessary + + void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices ) + void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm ) + void ALL::ALL<T,W>::setWork ( const W work ) + +If the given communicator is not a cartesian communicator the process +grid parameters must also be set beforehand (!) using + + void ALL::ALL<T,W>::setProcGridParams ( + const std::vector<int> &location, + const std::vector<int> &size) + +with the location of the current process and number of processes in each +direction. + +To trace the current domain better an integer tag can be given, that is +used in the domain output, with + + void ALL::ALL<T,W>::setProcTag ( int tag ) + +and the minimal domain size in each direction can be set using + + void ALL::ALL<T,W>::setMinDomainSize ( + const std::vector<T> &minSize ) + +Once these parameters are set call + + void ALL::ALL<T,W>::setup() + +so internal set up routines can run. + +### Balancing +To create the new domain vertices make sure the current vertices are set +using `setVertices` and the work is set using `setWork` then call + + void ALL::ALL<T,W>::balance() + +and then retrieve the new vertices using + + std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices () + +or new neighbors with + + void ALL::ALL<T,W>::getNeighbors ( std::vector<in> &neighbors ) + +### Special considerations for the Fortran interface +The Fortran interface exists in two versions. Either in the form of + + ALL_balance(all_object) + +where the all object of type `type(ALL_t)` is given as first argument to +each function and as more object oriented functions to the object +itself. So the alternative call would be + + all_object%balance() + +The function names themselves are similar, i.e. instead of camel case +they are all lowercase except the first `ALL_` with underscores between +words. So `setProcGridParams` becomes `ALL_set_proc_grid_parms`. For +more usage info please check the Fortran example programs. + +One important difference is the handling of MPI types, which change +between the MPI Fortran 2008 interface as given by the `mpi_f08` module +and previous interfaces. In previous interfaces all MPI types are +`INTEGER`s, but using the `mpi_f08` module the communicator is of +`type(MPI_Comm)`. To allow using `ALL_set_communicator` directly with +the communicator used in the user's application, build the library with +enabled Fortran 2008 features and this communicator type is expected. + + +### Details of the balancing methods +#### Tensor based +In order to equalize the load of individual domains, the assumption is +made that this can be achieved by equalizing the work in each cartesian +direction, i.e. the work of all domains having the same coordinate in a +cartesian direction is collected and the width of all these domains in +this direction is adjusted by comparing this collective work with the +collective work of the neighboring domains. This is done independently +for each cartesian direction in the system. + +Required number of vertices: + - two, one describing the lower left front point and one describing the + upper right back point of the domain + +Advantages: + - topology of the system is maintained (orthogonal domains, neighbor + relations) + - no update for the neighbor relations need to be made in the calling + code + - if a code was able to deal with orthogonal domains, only small + changes are expected to include this strategy + +Disadvantages: + - due to the comparison of collective work loads in cartesian layers + and restrictions resulting from this construction, the final result + might lead to a sub-optimal domain distribution + +#### Staggered grid +The staggered grid approach is a hierarchical one. In a first step the +work of all domains sharing the same cartesian coordinate with respect +to the highest dimension (z in three dimensions) is collected. Then, +like in the TENSOR strategy the layer width in this dimension is +adjusted based on comparison of the collected work with the collected +work of the neighboring domains in the same dimension. As a second step +each of these planes is divided into a set of columns, where all domains +share the same cartesian coordinate in the next lower dimension (y in +three dimensions). For each of these columns the before described +procedure is repeated, i.e. work collected and the width of the columns +adjusted accordingly. Finally, in the last step the work of individual +domains is compared to direct neighbors and the width in the last +dimension (x in three dimensions) adjusted. This leads to a staggered +grid, that is much better suited to describe inhomogeneous work +distributions than the TENSOR strategy. + +Required number of vertices: + - two, one describing the lower left front point and one describing the + upper right back point of the domain + +Advantages: + - very good equalization results for the work load + - maintains orthogonal domains + +Disadvantages: + - changes topology of the domains and requires adjustment of neighbor + relations + - communication pattern in the calling code might require adjustment to + deal with changing neighbors + +#### Histogram based staggered grid + +The histogram-based method works differently than the first two methods +in such a way, that it is a global method that requires three distinct +steps for a single adjustment. In each of these steps the following +takes place: a partial histogram needs to be created over the workload, +e.g. number of particles, along one direction on each domain, then these +are supplied to the method, where a global histogram is computed. With +this global histogram a distribution function is created. This is used +to compute the optimal (possible) width of domains in that direction. +For the second and third steps the computation of the global histograms +and distribution functions take place in subsystems, being the results +of the previous step. The result is the most optimal distribution of +domains according to the Staggered-grid method, at the cost of global +exchange of work, due to the global adjustment, which makes the method +not well suited to highly dynamic systems, due to the need of frequent +updates. On the other hand the method is well suited for static +problems, e.g. grid-based simulations. + +Note: Currently the order of dimensions is: z-y-x. + +Required number of vertices: + - two, one describing the lower left front point and one describing the + upper right back point of the domain + +Additional requirements: + - partial histogram created over the workload on the local domain in + the direction of the current correction step + +Advantages: + - supplies an optimal distribution of domains (restricted by width of + bins used for the histogram) + - only three steps needed to acquire result + +Disadvantages: + - expensive in cost of communication, due to global shifts + - requires preparation of histogram and shift of work between each of + the three correction steps + + +## Example programs +In the `/examples` sub directory several C++ and Fortran example +programs are placed. + +### `ALL_test` +MPI C++ code that generates a particle distribution over the domains. At +program start the domains form an orthogonal decomposition of the cubic +3d system. Each domain has the same volume as each other domain. +Depending on the cartesian coordinates of the domain in this +decomposition, a number of points is created on each domain. The points +are then distributed uniformly over the domain. For the number of points +a polynomial formula is used to create a non-uniform point distribution. +As an estimation of the work load in this program the number of points +within a domain was chosen. There is no particle movement included in +the current version of the example code, therefore particle +communication only takes place due to the shift of domain borders. + +The program creates three output files in its basic version: + + - `minmax.dat`: + - **Columns**: `<iteration count> <W_min/W_avg> <W_max_W_avg> + <(W_max-W_min)/(W_max+W_min)>` + - **Explanation**: In order to try to give a comparable metric for + the success of the load balancing procedure the relatative + difference of the minimum and maximum work loads in the system in + relation to the average work load of the system are given. The + last column gives an indicator for the inequality of work in the + system, in a perfectly balanced system, this value should be equal + to zero. + + - `stddev.txt`: + - **Columns**: `<iteration count> <std_dev>` + - **Explanation**: The standard deviation of work load over all + domains. + + - `domain_data.dat`: + - **Columns**: `<rank> <x_min> <x_max> <y_min> <y_max> <z_min> + <z_max> <W>` + - **Explanation**: There are two blocks of rows, looking like this, + the first block is the starting configuration of the domains, + which should be a uniform grid of orthogonal domains. The second + block is the configuration after 200 load balancing steps. In each + line the MPI rank of the domain and the extension in each + cartesian direction is given, as well as the work load (the number + of points). + +If the library is compiled with VTK support, `ALL_test` also creates a VTK +based output of the domains in each iteration step. This output will be +placed in a `vtk_outline` sub directory, which is created if it does not +already exist. The resulting output can be visualized with tools like +ParaView or VisIt. + +### `ALL_test_f` and `ALL_test_f_obj` +The Fortran example provides a more basic version of the test program +`ALL_test`, as its main goal is to show the functionality of the Fortran +interface. The code creates a basic uniform orthogonal domain +decomposition and creates an inhomogeneous particle distribution over +these. Only one load balancing step is executed and the program prints +out the domain distribution of the start configuration and of the final +configuration. + +### `ALL_Staggered` and `ALL_Staggered_f` +These create a very simple load balancing situation with fixed load and +domains placed along the z direction. The C++ and Fortran versions are +functionally the same, so the differences between the interfaces can be +checked. + +<!-- vim: set tw=72 et ts=4 spell spelllang=en_us ft=markdown: --> -- GitLab