From 25511aa0d9167d72b65dafa66c740edf329335b7 Mon Sep 17 00:00:00 2001
From: Stephan Schulz <stephan.schulz-x2q@rub.de>
Date: Tue, 24 Nov 2020 17:10:38 +0100
Subject: [PATCH] first version of new version of README

---
 README_new.md | 396 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 396 insertions(+)
 create mode 100644 README_new.md

diff --git a/README_new.md b/README_new.md
new file mode 100644
index 0000000..6cfd91a
--- /dev/null
+++ b/README_new.md
@@ -0,0 +1,396 @@
+# A Load Balancing Library (ALL)
+
+The library aims to provide an easy way to include dynamic domain-based
+load balancing into particle based simulation codes. The library is
+developed in the Simulation Laboratory Molecular Systems of the Jülich
+Supercomputing Centre at Forschungszentrum Jülich. 
+
+It includes several load balancing schemes. The following list gives and
+overview about these schemes and short explanations.
+
+ - **Tensor product**: The work on all processes is reduced over the
+   cartesian planes in the systems. This work is then equalized by
+   adjusting the borders of the cartesian planes.
+ - **Staggered grid**: A 3-step hierarchical approach is applied, where:
+   1. work over the cartesian planes is reduced, before the borders of
+      these planes are adjusted
+   2. in each of the cartesian planes the work is reduced for each
+      cartesian column. These columns are then adjusted to each other to
+      homogenize the work in each column
+   3. the work between neighboring domains in each column is adjusted.
+      Each adjustment is done locally with the neighboring planes,
+      columns or domains by adjusting the adjacent boundaries.
+ - **Topological mesh** (WIP): In contrast to the previous methods this
+   method adjusts domains not by moving boundaries but vertices, i.e.
+   corner points, of domains. For each vertex a force, based on the
+   differences in work of the neighboring domains, is computed and the
+   vertex is shifted in a way to equalize the work between these
+   neighboring domains.
+ - **Voronoi mesh** (WIP): Similar to the topological mesh method, this
+   method computes a force, based on work differences. In contrast to
+   the topological mesh method, the force acts on a Voronoi point rather
+   than a vertex, i.e. a point defining a Voronoi cell, which describes
+   the domain. Consequently, the number of neighbors is not a conserved
+   quantity, i.e. the topology may change over time. ALL uses the Voro++
+   library published by the Lawrence Berkeley Laboratory for the
+   generation of the Voronoi mesh.
+ - **Histogram based staggered grid**: Resulting in the same grid, as
+   the staggered grid scheme, this scheme uses the cumulative work
+   function in each of the three cartesian directions in order to
+   generate this grid. Using histograms and the previously defined
+   distribution of process domains in a cartesian grid, this scheme
+   generates in three steps a staggered-grid result, in which the work
+   is distributed as evenly as the resolution of the underlying
+   histogram allows. In contrast to the previously mentioned schemes
+   this scheme depends on a global exchange of work between processes.
+ - **Orthogonal recursive bisection** (Planned): Comparable to the
+   histogram based staggered grid scheme, this scheme uses cumulative
+   work functions evaluated by histograms in order to find a new
+   distribution of workload in a hierarchical manner. In contrast to the
+   histogram based staggered grid scheme, the subdivision of domains is
+   not based on a cartesian division of processes, but on the prime
+   factorization of the total number of processes. During each step, the
+   current slice of the system is distributed into smaller sub-slices
+   along the longest edge, that roughly contain the same workload. The
+   number of sub-slices in then determined by the corresponding prime
+   number of that step, i.e. when trying to establish a distribution for
+   24 (3 * 2 * 2 * 2) processes, the bisection would use four steps. In
+   the first step the system would be subdivided into 3 sub domains
+   along the largest edge, each containing roughly the same workload.
+   Each of these sub domains then is divided into two sub sub domains
+   independently. This procedure is then repeated for all the prime
+   numbers in the prime factorization.
+
+
+## Installation and Requirements
+
+### Requirements
+Base requirements:
+
+ - C++11 capable compiler
+ - MPI support
+ - CMake v. 3.14 or higher
+
+Optional requirements:
+
+ - Fortran 2003 capable compiler (Fortran interface)
+ - Fortran 2008 capable compiler (Usage of the `mpi_f08` interface)
+ - VTK 7.1 or higher (Domain output)
+ - Boost testing utilities
+ - Doxygen and Sphinx with `breathe` (Documentation)
+
+### Installation
+
+ 1. Clone the library from
+    `https://gitlab.version.fz-juelich.de/SLMS/loadbalancing` into
+    `$ALL_ROOT_DIR`.
+ 2. Create the build directory `$ALL_BUILD_DIR` some place else.
+ 3. Call `cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"` to set up the
+    installation. Compile time features can be selected from:
+     - `-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR` (default: system
+       dependant): Install into `$ALL_INSTALL_DIR` after compiling.
+     - `-DCM_ALL_VTK_OUTPUT=ON` (default: `OFF`): Enable VTK output
+       of domains. Requires VTK 7.1 or higher.
+     - `-DCM_ALL_FORTRAN=ON` (default: `OFF`): Enable Fortran interface.
+       Requires Fortran 2003 capable compiler.
+     - `-DCM_ALL_USE_F08=ON` (default: `OFF`): Enable usage of `mpi_f08`
+       module for MPI. Requires Fortran 2008 capable compiler and
+       compatible MPI installation.
+     - `-DCM_ALL_VORONOI=ON` (default: `OFF`): Enable Voronoi mesh
+       method and subsequently compilation and linkage of Voro++.
+     - `-DCM_ALL_TESTING=ON` (default: `OFF`): Turns on unit and feature
+       tests. Can be run from the build directory with `ctest`. Requires
+       the Boost test utilities.
+     - `-DCMAKE_BUILD_TYPE=Debug` (default: `Release`): Enable library
+       internal debugging features. Using `DebugWithOpt` also turns on
+       some optimizations.
+     - `-DCM_ALL_DEBUG=ON` (default: `OFF`): Enable self consistency
+       checks within the library itself.
+     - `-DVTK_DIR=$VTK_DIR` (default: system dependant): Path to an
+       explicit VTK installation. Make sure that VTK is compiled with
+       the same MPI as the library and subsequently your code.
+    To use a specific compiler and Boost installation use: `CC=gcc
+    CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`.
+  4. To build and install the library then run: `cmake --build
+     "$ALL_BUILD_DIR"` and `cmake --install "$ALL_BUILD_DIR"`.
+     Afterwards, the built examples and library files are placed in
+     `$ALL_INSTALL_DIR`.
+
+
+## Usage
+The library is presented as a single load balancing object `ALL::ALL`
+from `ALL.hpp`. All classes are encapsulated in the `ALL` name space.
+Simple (and not so simple) examples are available in the `/examples` sub
+directory. The simple examples are also documented at length.
+
+### ALL object
+The ALL object can be constructed with
+
+    ALL::ALL<T,W> (
+                   const ALL::LB_t method,
+                   const int dimension,
+                   const T gamma)
+
+Where the type of the boundaries is given by `T` and the type of the
+work as `W`, which are usually float or double. The load balancing
+method must be one of `ALL::LB_t::TENSOR`, `...STAGGERED`,
+`...UNSTRUCTURED`, `...VORONOI`, `...HISTOGRAM`. Where the Voronoi
+method must be enabled at compile time. There is also a second form
+where the initial domain vertices are given:
+
+    ALL::ALL<T,W> (
+                   const ALL::LB_t method,
+                   const int dimension,
+                   const std::vector<ALL::Point<T>> &initialDomain,
+                   const T gamma)
+
+### Point object
+The domain boundaries are given and retrieved as a vector of
+`ALL::Point`s. These can be constructed via
+
+    ALL::Point<T> ( const int dimension )
+    ALL::Point<T> ( const int dimension, const T *values )
+    ALL::Point<T> ( const std::vector<T> &values )
+
+and accessed via the `[]` operator like a `std::vector`
+
+### Set up
+A few additional parameters must be set before the domains can be
+balanced. Exactly which are dependent on the method used, but in general
+the following are necessary
+
+    void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices )
+    void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm )
+    void ALL::ALL<T,W>::setWork ( const W work )
+
+If the given communicator is not a cartesian communicator the process
+grid parameters must also be set beforehand (!) using
+
+    void ALL::ALL<T,W>::setProcGridParams (
+            const std::vector<int> &location,
+            const std::vector<int> &size)
+
+with the location of the current process and number of processes in each
+direction.
+
+To trace the current domain better an integer tag can be given, that is
+used in the domain output, with
+
+    void ALL::ALL<T,W>::setProcTag ( int tag )
+
+and the minimal domain size in each direction can be set using
+
+    void ALL::ALL<T,W>::setMinDomainSize (
+            const std::vector<T> &minSize )
+ 
+Once these parameters are set call
+
+    void ALL::ALL<T,W>::setup()
+
+so internal set up routines can run.
+
+### Balancing
+To create the new domain vertices make sure the current vertices are set
+using `setVertices` and the work is set using `setWork` then call
+
+    void ALL::ALL<T,W>::balance()
+
+and then retrieve the new vertices using
+
+    std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices ()
+
+or new neighbors with
+
+    void ALL::ALL<T,W>::getNeighbors ( std::vector<in> &neighbors )
+
+### Special considerations for the Fortran interface
+The Fortran interface exists in two versions. Either in the form of
+
+    ALL_balance(all_object)
+
+where the all object of type `type(ALL_t)` is given as first argument to
+each function and as more object oriented functions to the object
+itself. So the alternative call would be
+
+    all_object%balance()
+
+The function names themselves are similar, i.e. instead of camel case
+they are all lowercase except the first `ALL_` with underscores between
+words. So `setProcGridParams` becomes `ALL_set_proc_grid_parms`. For
+more usage info please check the Fortran example programs.
+
+One important difference is the handling of MPI types, which change
+between the MPI Fortran 2008 interface as given by the `mpi_f08` module
+and previous interfaces. In previous interfaces all MPI types are
+`INTEGER`s, but using the `mpi_f08` module the communicator is of
+`type(MPI_Comm)`. To allow using `ALL_set_communicator` directly with
+the communicator used in the user's application, build the library with
+enabled Fortran 2008 features and this communicator type is expected.
+
+
+### Details of the balancing methods
+#### Tensor based
+In order to equalize the load of individual domains, the assumption is
+made that this can be achieved by equalizing the work in each cartesian
+direction, i.e. the work of all domains having the same coordinate in a
+cartesian direction is collected and the width of all these domains in
+this direction is adjusted by comparing this collective work with the
+collective work of the neighboring domains. This is done independently
+for each cartesian direction in the system.
+
+Required number of vertices:
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+
+Advantages:
+ - topology of the system is maintained (orthogonal domains, neighbor
+   relations)
+ - no update for the neighbor relations need to be made in the calling
+   code
+ - if a code was able to deal with orthogonal domains, only small
+   changes are expected to include this strategy
+
+Disadvantages:
+ - due to the comparison of collective work loads in cartesian layers
+   and restrictions resulting from this construction, the final result
+   might lead to a sub-optimal domain distribution
+
+#### Staggered grid
+The staggered grid approach is a hierarchical one. In a first step the
+work of all domains sharing the same cartesian coordinate with respect
+to the highest dimension (z in three dimensions) is collected. Then,
+like in the TENSOR strategy the layer width in this dimension is
+adjusted based on comparison of the collected work with the collected
+work of the neighboring domains in the same dimension. As a second step
+each of these planes is divided into a set of columns, where all domains
+share the same cartesian coordinate in the next lower dimension (y in
+three dimensions). For each of these columns the before described
+procedure is repeated, i.e. work collected and the width of the columns
+adjusted accordingly. Finally, in the last step the work of individual
+domains is compared to direct neighbors and the width in the last
+dimension (x in three dimensions) adjusted. This leads to a staggered
+grid, that is much better suited to describe inhomogeneous work
+distributions than the TENSOR strategy.
+
+Required number of vertices:
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+
+Advantages:
+ - very good equalization results for the work load
+ - maintains orthogonal domains
+
+Disadvantages:
+ - changes topology of the domains and requires adjustment of neighbor
+   relations
+ - communication pattern in the calling code might require adjustment to
+   deal with changing neighbors
+
+#### Histogram based staggered grid
+
+The histogram-based method works differently than the first two methods
+in such a way, that it is a global method that requires three distinct
+steps for a single adjustment. In each of these steps the following
+takes place: a partial histogram needs to be created over the workload,
+e.g. number of particles, along one direction on each domain, then these
+are supplied to the method, where a global histogram is computed. With
+this global histogram a distribution function is created. This is used
+to compute the optimal (possible) width of domains in that direction.
+For the second and third steps the computation of the global histograms
+and distribution functions take place in subsystems, being the results
+of the previous step.  The result is the most optimal distribution of
+domains according to the Staggered-grid method, at the cost of global
+exchange of work, due to the global adjustment, which makes the method
+not well suited to highly dynamic systems, due to the need of frequent
+updates. On the other hand the method is well suited for static
+problems, e.g. grid-based simulations.
+
+Note: Currently the order of dimensions is: z-y-x.
+
+Required number of vertices:
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+
+Additional requirements:
+ - partial histogram created over the workload on the local domain in
+   the direction of the current correction step
+
+Advantages:
+ - supplies an optimal distribution of domains (restricted by width of
+   bins used for the histogram)
+ - only three steps needed to acquire result
+
+Disadvantages:
+ - expensive in cost of communication, due to global shifts
+ - requires preparation of histogram and shift of work between each of
+   the three correction steps
+
+
+## Example programs
+In the `/examples` sub directory several C++ and Fortran example
+programs are placed.
+
+### `ALL_test`
+MPI C++ code that generates a particle distribution over the domains. At
+program start the domains form an orthogonal decomposition of the cubic
+3d system. Each domain has the same volume as each other domain.
+Depending on the cartesian coordinates of the domain in this
+decomposition, a number of points is created on each domain. The points
+are then distributed uniformly over the domain. For the number of points
+a polynomial formula is used to create a non-uniform point distribution.
+As an estimation of the work load in this program the number of points
+within a domain was chosen. There is no particle movement included in
+the current version of the example code, therefore particle
+communication only takes place due to the shift of domain borders.
+
+The program creates three output files in its basic version:
+
+ - `minmax.dat`:
+    - **Columns**: `<iteration count> <W_min/W_avg> <W_max_W_avg>
+      <(W_max-W_min)/(W_max+W_min)>`
+    - **Explanation**: In order to try to give a comparable metric for
+      the success of the load balancing procedure the relatative
+      difference of the minimum and maximum work loads in the system in
+      relation to the average work load of the system are given. The
+      last column gives an indicator for the inequality of work in the
+      system, in a perfectly balanced system, this value should be equal
+      to zero.
+    
+ - `stddev.txt`:
+    - **Columns**: `<iteration count> <std_dev>`
+    - **Explanation**: The standard deviation of work load over all
+      domains.
+
+ - `domain_data.dat`:
+    - **Columns**: `<rank> <x_min> <x_max> <y_min> <y_max> <z_min>
+      <z_max> <W>`
+    - **Explanation**: There are two blocks of rows, looking like this,
+      the first block is the starting configuration of the domains,
+      which should be a uniform grid of orthogonal domains. The second
+      block is the configuration after 200 load balancing steps. In each
+      line the MPI rank of the domain and the extension in each
+      cartesian direction is given, as well as the work load (the number
+      of points).
+
+If the library is compiled with VTK support, `ALL_test` also creates a VTK
+based output of the domains in each iteration step. This output will be
+placed in a `vtk_outline` sub directory, which is created if it does not
+already exist. The resulting output can be visualized with tools like
+ParaView or VisIt.
+
+### `ALL_test_f` and `ALL_test_f_obj`
+The Fortran example provides a more basic version of the test program
+`ALL_test`, as its main goal is to show the functionality of the Fortran
+interface. The code creates a basic uniform orthogonal domain
+decomposition and creates an inhomogeneous particle distribution over
+these. Only one load balancing step is executed and the program prints
+out the domain distribution of the start configuration and of the final
+configuration.
+
+### `ALL_Staggered` and `ALL_Staggered_f`
+These create a very simple load balancing situation with fixed load and
+domains placed along the z direction. The C++ and Fortran versions are
+functionally the same, so the differences between the interfaces can be
+checked.
+
+<!-- vim: set tw=72 et ts=4 spell spelllang=en_us ft=markdown: -->
-- 
GitLab