move information from README to sphinx documentation

65c196b6 · Stephan Schulz · c7b0a5bd · 65c196b6 · 65c196b6 · 65c196b6
Commit 65c196b6 authored 4 years ago by Stephan Schulz
--- a/README.md
+++ b/README.md
@@ -5,62 +5,10 @@ load balancing into particle based simulation codes. The library is
 developed in the Simulation Laboratory Molecular Systems of the Jülich
 Supercomputing Centre at Forschungszentrum Jülich. 
-It includes several load balancing schemes. The following list gives an
+Only a brief summary is given here and more information can be found in
-overview about these schemes and short explanations.
+the [official documentation](docs/index.rst) such as a detailed API
+description, examples and further information regarding the load
- - **Tensor product**: The work on all processes is reduced over the
+balancing methods.
-   cartesian planes in the systems. This work is then equalized by
-   adjusting the borders of the cartesian planes.
- - **Staggered grid**: A 3-step hierarchical approach is applied, where:
-   1. work over the cartesian planes is reduced, before the borders of
-      these planes are adjusted
-   2. in each of the cartesian planes the work is reduced for each
-      cartesian column. These columns are then adjusted to each other to
-      homogenize the work in each column
-   3. the work between neighboring domains in each column is adjusted.
-      Each adjustment is done locally with the neighboring planes,
-      columns or domains by adjusting the adjacent boundaries.
- - **Topological mesh** (WIP): In contrast to the previous methods this
-   method adjusts domains not by moving boundaries but vertices, i.e.
-   corner points, of domains. For each vertex a force, based on the
-   differences in work of the neighboring domains, is computed and the
-   vertex is shifted in a way to equalize the work between these
-   neighboring domains.
- - **Voronoi mesh** (WIP): Similar to the topological mesh method, this
-   method computes a force, based on work differences. In contrast to
-   the topological mesh method, the force acts on a Voronoi point rather
-   than a vertex, i.e. a point defining a Voronoi cell, which describes
-   the domain. Consequently, the number of neighbors is not a conserved
-   quantity, i.e. the topology may change over time. ALL uses the Voro++
-   library published by the Lawrence Berkeley National Laboratory for the
-   generation of the Voronoi mesh.
- - **Histogram based staggered grid**: Resulting in the same grid, as
-   the staggered grid scheme, this scheme uses the cumulative work
-   function in each of the three cartesian directions in order to
-   generate this grid. Using histograms and the previously defined
-   distribution of process domains in a cartesian grid, this scheme
-   generates in three steps a staggered-grid result, in which the work
-   is distributed as evenly as the resolution of the underlying
-   histogram allows. In contrast to the previously mentioned schemes
-   this scheme depends on a global exchange of work between processes.
- - **Orthogonal recursive bisection** (Planned): Comparable to the
-   histogram based staggered grid scheme, this scheme uses cumulative
-   work functions evaluated by histograms in order to find a new
-   distribution of workload in a hierarchical manner. In contrast to the
-   histogram based staggered grid scheme, the subdivision of domains is
-   not based on a cartesian division of processes, but on the prime
-   factorization of the total number of processes. During each step, the
-   current slice of the system is distributed into smaller sub-slices
-   along the longest edge, that roughly contain the same workload. The
-   number of sub-slices in then determined by the corresponding prime
-   number of that step, i.e. when trying to establish a distribution for
-   24 (3 * 2 * 2 * 2) processes, the bisection would use four steps. In
-   the first step the system would be subdivided into 3 sub domains
-   along the largest edge, each containing roughly the same workload.
-   Each of these sub domains then is divided into two sub sub domains
-   independently. This procedure is then repeated for all the prime
-   numbers in the prime factorization.
 ## Installation and Requirements
@@ -86,337 +34,11 @@ Optional requirements:
    `$ALL_ROOT_DIR`.
 2. Create the build directory `$ALL_BUILD_DIR` some place else.
 3. Call `cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"` to set up the
-    installation. Compile time features can be selected from:
+    installation.  To use a specific compiler and Boost installation
-     - `-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR` (default: system
+    use: `CC=gcc CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`.
-       dependant): Install into `$ALL_INSTALL_DIR` after compiling.
-     - `-DCM_ALL_VTK_OUTPUT=ON` (default: `OFF`): Enable VTK output
-       of domains. Requires VTK 7.1 or higher.
-     - `-DCM_ALL_FORTRAN=ON` (default: `OFF`): Enable Fortran interface.
-       Requires Fortran 2003 capable compiler.
-     - `-DCM_ALL_USE_F08=ON` (default: `OFF`): Enable usage of `mpi_f08`
-       module for MPI. Requires Fortran 2008 capable compiler and
-       compatible MPI installation.
-     - `-DCM_ALL_FORTRAN_ERROR_ABORT=ON` (default: `OFF): Abort
-       execution on any error when using the Fortran interface instead
-       of setting the error number and leaving error handling to the
-       user.
-     - `-DCM_ALL_VORONOI=ON` (default: `OFF`): Enable Voronoi mesh
-       method and subsequently compilation and linkage of Voro++.
-     - `-DCM_ALL_TESTING=ON` (default: `OFF`): Turns on unit and feature
-       tests. Can be run from the build directory with `ctest`. Requires
-       the Boost test utilities.
-     - `-DCMAKE_BUILD_TYPE=Debug` (default: `Release`): Enable library
-       internal debugging features. Using `DebugWithOpt` also turns on
-       some optimizations.
-     - `-DCM_ALL_DEBUG=ON` (default: `OFF`): Enable self consistency
-       checks within the library itself.
-     - `-DVTK_DIR=$VTK_DIR` (default: system dependant): Path to an
-       explicit VTK installation. Make sure that VTK is compiled with
-       the same MPI as the library and subsequently your code.
-    To use a specific compiler and Boost installation use: `CC=gcc
-    CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`.
  4. To build and install the library then run: `cmake --build
     "$ALL_BUILD_DIR"` and `cmake --install "$ALL_BUILD_DIR"`.
     Afterwards, the built examples and library files are placed in
     `$ALL_INSTALL_DIR`.
-## Usage
-The library is presented as a single load balancing object `ALL::ALL`
-from `ALL.hpp`. All classes are encapsulated in the `ALL` name space.
-Simple (and not so simple) examples are available in the `/examples` sub
-directory. The simple examples are also documented at length.
-Errors are treated as exceptions that are thrown of a (sub) class of
-`ALL::CustomException`. Likely candidates that may throw exceptions
- are `balance`,`setup`, `printVTKoutlines`.
-### ALL object
-The ALL object can be constructed with
-    ALL::ALL<T,W> (
-                   const ALL::LB_t method,
-                   const int dimension,
-                   const T gamma)
-Where the type of the boundaries is given by `T` and the type of the
-work as `W`, which are usually float or double. The load balancing
-method must be one of `ALL::LB_t::TENSOR`, `...STAGGERED`,
-`...FORCEBASED`, `...VORONOI`, `...HISTOGRAM`. Where the Voronoi
-method must be enabled at compile time. There is also a second form
-where the initial domain vertices are already provided:
-    ALL::ALL<T,W> (
-                   const ALL::LB_t method,
-                   const int dimension,
-                   const std::vector<ALL::Point<T>> &initialDomain,
-                   const T gamma)
-### Point object
-The domain boundaries are given and retrieved as a vector of
-`ALL::Point`s. These can be constructed via
-    ALL::Point<T> ( const int dimension )
-    ALL::Point<T> ( const int dimension, const T *values )
-    ALL::Point<T> ( const std::vector<T> &values )
-and its elements can be accessed through the `[]` operator like a `std::vector`
-### Set up
-A few additional parameters must be set before the domains can be
-balanced. Which exactly need to be set, is dependent on the method used, 
-but in general the following are necessary
-    void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices )
-    void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm )
-    void ALL::ALL<T,W>::setWork ( const W work )
-If the given communicator is not a cartesian communicator the process
-grid parameters must also be set beforehand (!) using
-    void ALL::ALL<T,W>::setProcGridParams (
-            const std::vector<int> &location,
-            const std::vector<int> &size)
-with the location of the current process and number of processes in each
-direction.
-To trace the current domain better an integer tag can be provided, which is
-used in the domain output, with
-    void ALL::ALL<T,W>::setProcTag ( int tag )
-and an observed minimal domain size in each direction can be set using
-    void ALL::ALL<T,W>::setMinDomainSize (
-            const std::vector<T> &minSize )
-Once these parameters are set call
-    void ALL::ALL<T,W>::setup()
-so internal set up routines are run.
-### Balancing
-To create the new domain vertices make sure the current vertices are set
-using `setVertices` and the work is set using `setWork` then call
-    void ALL::ALL<T,W>::balance()
-and then the new vertices can be retrieved by using
-    std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices ()
-or new neighbors with
-    std::vector<int> &ALL::ALL<T,W>::getNeighbors ()
-### Special considerations for the Fortran interface
-The Fortran interface exists in two versions. Either in the form of
-    ALL_balance(all_object)
-where the all object of type `type(ALL_t)` is given as first argument to
-each function and as more object oriented functions to the object
-itself. So the alternative call would be
-    all_object%balance()
-The function names themselves are similar, i.e. instead of camel case
-they are all lowercase except the first `ALL_` with underscores between
-words. So `setProcGridParams` becomes `ALL_set_proc_grid_parms`. For
-more usage info please check the Fortran example programs.
-One important difference is the handling of MPI types, which change
-between the MPI Fortran 2008 interface as given by the `mpi_f08` module
-and previous interfaces. In previous interfaces all MPI types are
-`INTEGER`s, but using the `mpi_f08` module the communicator is of
-`type(MPI_Comm)`. To allow using `ALL_set_communicator` directly with
-the communicator used in the user's application, build the library with
-enabled Fortran 2008 features and this communicator type is expected.
-Retrieving information from the balancer is also different, since most
-getter return (a reference to) an object itself. The Fortran subroutines
-set the values of its arguments. As an example
-    int work = all.getWork();
-becomes
-    integer(c_int) :: work
-    call ALL_get_work(work) !or
-    !call all%get_work(work)
-Since there is no exception handling in Fortran, the error handling from
-the libc is borrowed. So there are additional function that retrieve and
-reset an error number as well as an additional error message describing
-the error. These must be queried and handled by the user. An example of
-this usage is shown in the `ALL_Staggered_f` example when printing the
-VTK outlines.
-### Details of the balancing methods
-#### Tensor based
-In order to equalize the load of individual domains, the assumption is
-made that this can be achieved by equalizing the work in each cartesian
-direction, i.e. the work of all domains having the same coordinate in a
-cartesian direction is collected and the width of all these domains in
-this direction is adjusted by comparing this collective work with the
-collective work of the neighboring domains. This is done independently
-for each cartesian direction in the system.
-Required number of vertices:
- - two, one describing the lower left front point and one describing the
-   upper right back point of the domain
-Advantages:
- - topology of the system is maintained (orthogonal domains, neighbor
-   relations)
- - no update for the neighbor relations is required in the calling
-   code
- - if a code was able to deal with orthogonal domains, only small
-   changes are expected to include this strategy
-Disadvantages:
- - due to the comparison of collective work loads in cartesian layers
-   and restrictions resulting from this construction, the final result
-   might lead to a sub-optimal domain distribution
-#### Staggered grid
-The staggered grid approach is a hierarchical one. In a first step the
-work of all domains sharing the same cartesian coordinate with respect
-to the highest dimension (z in three dimensions) is collected. Then,
-like in the TENSOR strategy the layer width in this dimension is
-adjusted based on comparison of the collected work with the collected
-work of the neighboring domains in the same dimension. As a second step
-each of these planes is divided into a set of columns, where all domains
-share the same cartesian coordinate in the next lower dimension (y in
-three dimensions). For each of these columns the before described
-procedure is repeated, i.e. work collected and the width of the columns
-adjusted accordingly. Finally, in the last step the work of individual
-domains is compared to direct neighbors and the width in the last
-dimension (x in three dimensions) adjusted. This leads to a staggered
-grid, that is much better suited to describe inhomogeneous work
-distributions than the TENSOR strategy.
-Required number of vertices:
- - two, one describing the lower left front point and one describing the
-   upper right back point of the domain
-Advantages:
- - very good equalization results for the work load
- - maintains orthogonal domains
-Disadvantages:
- - changes topology of the domains and requires adjustment of neighbor
-   relations
- - communication pattern in the calling code might require adjustment to
-   deal with changing neighbors
-#### Histogram based staggered grid
-The histogram-based method works differently than the first two methods
-in such a way, that it is a global method that requires three distinct
-steps for a single adjustment. In each of these steps the following
-takes place: a partial histogram needs to be created over the workload,
-e.g. number of particles, along one direction on each domain, then these
-are supplied to the method, where a global histogram is computed. With
-this global histogram a distribution function is created. This is used
-to compute the optimal (possible) width of domains in that direction.
-For the second and third steps the computation of the global histograms
-and distribution functions take place in subsystems, being the results
-of the previous step.  The result is the most optimal distribution of
-domains according to the Staggered-grid method, at the cost of global
-exchange of work, due to the global adjustment, which makes the method
-not well suited to highly dynamic systems, due to the need of frequent
-updates. On the other hand the method is well suited for static
-problems, e.g. grid-based simulations.
-Note: Currently the order of dimensions is: z-y-x.
-Required number of vertices:
- - two, one describing the lower left front point and one describing the
-   upper right back point of the domain
-Additional requirements:
- - partial histogram created over the workload on the local domain in
-   the direction of the current correction step
-Advantages:
- - supplies an optimal distribution of domains (restricted by width of
-   bins used for the histogram)
- - only three steps needed to acquire result
-Disadvantages:
- - expensive in cost of communication, due to global shifts
- - requires preparation of histogram and shift of work between each of
-   the three correction steps
-## Example programs
-In the `/examples` sub directory several C++ and Fortran example
-programs are placed.
-### `ALL_test`
-MPI C++ code that generates a particle distribution over the domains. At
-program start the domains form an orthogonal decomposition of the cubic
-3d system. Each domain has the same volume as each other domain.
-Depending on the cartesian coordinates of the domain in this
-decomposition, a number of points is created on each domain. The points
-are then distributed uniformly over the domain. For the number of points
-a polynomial formula is used to create a non-uniform point distribution.
-As an estimation of the work load in this program the number of points
-within a domain was chosen. There is no particle movement included in
-the current version of the example code, therefore particle
-communication only takes place due to the shift of domain borders.
-The program creates three output files in its basic version:
- - `minmax.dat`:
-    - **Columns**: `<iteration count> <W_min/W_avg> <W_max_W_avg>
-      <(W_max-W_min)/(W_max+W_min)>`
-    - **Explanation**: In order to try to give a comparable metric for
-      the success of the load balancing procedure the relatative
-      difference of the minimum and maximum work loads in the system in
-      relation to the average work load of the system are given. The
-      last column gives an indicator for the inequality of work in the
-      system, in a perfectly balanced system, this value should be equal
-      to zero.
- - `stddev.txt`:
-    - **Columns**: `<iteration count> <std_dev>`
-    - **Explanation**: The standard deviation of work load over all
-      domains.
- - `domain_data.dat`:
-    - **Columns**: `<rank> <x_min> <x_max> <y_min> <y_max> <z_min>
-      <z_max> <W>`
-    - **Explanation**: There are two blocks of rows, looking like this,
-      the first block is the starting configuration of the domains,
-      which should be a uniform grid of orthogonal domains. The second
-      block is the configuration after 200 load balancing steps. In each
-      line the MPI rank of the domain and the extension in each
-      cartesian direction is given, as well as the work load (the number
-      of points).
-If the library is compiled with VTK support, `ALL_test` also creates a VTK
-based output of the domains in each iteration step. This output will be
-placed in a `vtk_outline` sub directory, which is created if it does not
-already exist. The resulting output can be visualized with tools like
-ParaView or VisIt.
-### `ALL_test_f` and `ALL_test_f_obj`
-The Fortran example provides a more basic version of the test program
-`ALL_test`, as its main goal is to show the functionality of the Fortran
-interface. The code creates a basic uniform orthogonal domain
-decomposition and creates an inhomogeneous particle distribution over
-these. Only one load balancing step is executed and the program prints
-out the domain distribution of the start configuration and of the final
-configuration.
-### `ALL_Staggered` and `ALL_Staggered_f`
-These create a very simple load balancing situation with fixed load and
-domains placed along the z direction. The C++ and Fortran versions are
-functionally the same, so the differences between the interfaces can be
-checked.
 <!-- vim: set tw=72 et ts=4 spell spelllang=en_us ft=markdown: -->
--- a/docs/Examples.rst
+++ b/docs/Examples.rst
+.. _examples:
+Examples
+========
+In the ``/examples`` sub directory several C++ and Fortran example
+programs are placed.
+``ALL_test``
+------------
+MPI C++ code that generates a particle distribution over the domains. At
+program start the domains form an orthogonal decomposition of the cubic 3d
+system. Each domain has the same volume as each other domain. Depending
+on the cartesian coordinates of the domain in this decomposition, a number
+of points is created on each domain. The points are then distributed
+uniformly over the domain. For the number of points a polynomial formula
+is used to create a non-uniform point distribution. As an estimation of
+the work load in this program the number of points within a domain was
+chosen. There is no particle movement included in the current version of
+the example code, therefore particle communication only takes place due to
+the shift of domain borders.
+The program creates three output files in its basic version:
+``minmax.dat``
+  Columns
+    ``<iteration count> <W_min/W_avg> <W_max_W_avg>
+    <(W_max-W_min)/(W_max+W_min)>``
+  Explanation
+    In order to try to give a comparable metric for the success of the
+    load balancing procedure the relative difference of the minimum and
+    maximum work loads in the system in relation to the average work load
+    of the system are given. The last column gives an indicator for the
+    inequality of work in the system, in a perfectly balanced system, this
+    value should be equal to zero.
+``stddev.txt``
+  Columns
+    ``<iteration count> <std_dev>``
+  Explanation
+    The standard deviation of work load over all domains.
+``domain_data.dat``
+  Columns
+    ``<rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W>``
+  Explanation
+    There are two blocks of rows, looking like this, the first block is
+    the starting configuration of the domains, which should be a uniform
+    grid of orthogonal domains. The second block is the configuration
+    after 200 load balancing steps. In each line the MPI rank of the
+    domain and the extension in each cartesian direction is given, as well
+    as the work load (the number of points).
+If the library is compiled with VTK support, ``ALL_test`` also creates a
+VTK based output of the domains in each iteration step. This output will
+be placed in a ``vtk_outline`` sub directory, which is created if it does
+not already exist. The resulting output can be visualized with tools like
+ParaView or VisIt.
+``ALL_test_f`` and ``ALL_test_f_obj``
+-------------------------------------
+The Fortran example provides a more basic version of the test program
+``ALL_test``, as its main goal is to show the functionality of the Fortran
+interface. The code creates a basic uniform orthogonal domain
+decomposition and creates an inhomogeneous particle distribution over
+these. Only one load balancing step is executed and the program prints
+out the domain distribution of the start configuration and of the final
+configuration.
+``ALL_Staggered`` and ``ALL_Staggered_f``
+-----------------------------------------
+These create a very simple load balancing situation with fixed load and
+domains placed along the z direction. The C++ and Fortran versions are
+functionally the same, so the differences between the interfaces can be
+checked.
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
--- a/docs/Install.rst
+++ b/docs/Install.rst
+.. _install:
+Installation
+============
+Requirements
+------------
+Base requirements
+ - C++11 capable compiler
+ - MPI support
+ - CMake v.3.14 or higher
+Optional requirements
+ - Fortran 2003 capable compiler (Fortran interface)
+ - Fortran 2008 capable compiler (Usage of the ``mpi_f08`` interface)
+ - VTK 7.1 or higher (Domain output)
+ - Boost testing utilities
+ - Doxygen and Sphinx with ``breathe`` (Documentation)
+Installation
+------------
+1. Clone the library from
+   ``https://gitlab.version.fz-juelich.de/SLMS/loadbalancing`` into
+   ``$ALL_ROOT_DIR``.
+2. Create the build directory ``$ALL_BUILD_DIR`` some place else.
+3. Call ``cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"`` to set up the
+   installation. Other compile time features should be set here. To use a
+   specific compiler and Boost installation use: ``CC=gcc CXX=g++
+   BOOST_ROOT=$BOOST_DIR cmake [...]``.
+4. To build and install the library then run: ``cmake --build
+   "$ALL_BUILD_DIR"`` and ``cmake --install "$ALL_BUILD_DIR"``. Afterwards,
+   the built examples and library files are placed in ``$ALL_INSTALL_DIR``.
+Compile time features
+*********************
+``-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR`` (default: system dependent)
+  Install into ``$ALL_INSTALL_DIR`` after compiling.
+``-DCM_ALL_VTK_OUTPUT=ON`` (default: ``OFF``)
+  Enable VTK output of domains. Requires VTK 7.1 or higher.
+``-DCM_ALL_FORTRAN=ON`` (default: ``OFF``)
+  Enable Fortran interface.  Requires Fortran 2003 capable compiler.
+``-DCM_ALL_USE_F08=ON`` (default: ``OFF``)
+  Enable usage of ``mpi_f08`` module for MPI. Requires Fortran 2008
+  capable compiler and compatible MPI installation.
+``-DCM_ALL_FORTRAN_ERROR_ABORT=ON`` (default: ``OFF``)
+  Abort execution on any error when using the Fortran interface instead of
+  setting the error number and leaving error handling to the user.
+``-DCM_ALL_VORONOI=ON`` (default: ``OFF``)
+  Enable Voronoi mesh method and subsequently compilation and linkage of
+  Voro++.
+``-DCM_ALL_TESTING=ON`` (default: ``OFF``)
+  Turns on unit and feature tests. Can be run from the build directory
+  with ``ctest``. Requires the Boost test utilities.
+``-DCMAKE_BUILD_TYPE=Debug`` (default: ``Release``)
+  Enable library internal debugging features. Using ``DebugWithOpt`` also
+  turns on some optimizations.
+``-DCM_ALL_DEBUG=ON`` (default: ``OFF``)
+  Enable self consistency checks within the library itself.
+``-DVTK_DIR=$VTK_DIR`` (default: system dependent)
+  Path to an explicit VTK installation. Make sure that VTK is compiled
+  with the same MPI as the library and subsequently your code.
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
--- a/docs/Usage.rst
+++ b/docs/Usage.rst
+.. _usage:
+Usage
+=====
+The library is presented as a single load balancing object ``ALL::ALL``
+from ``ALL.hpp``. All classes are encapsulated in the ``ALL`` name space.
+Simple (and not so simple) examples are available in the ``/examples`` sub
+directory.  The simple examples are also documented at length.
+Errors are treated as exceptions that are thrown of a (sub) class of
+``ALL::CustomException``. Likely candidates that may throw exceptions are
+``balance``,``setup``, ``printVTKoutlines``.
+ALL object
+----------
+The ALL object can be constructed with
+.. code-block:: c++
+    ALL::ALL<T,W> (
+                   const ALL::LB_t method,
+                   const int dimension,
+                   const T gamma)
+Where the type of the boundaries is given by ``T`` and the type of the
+work as ``W``, which are usually float or double. The load balancing
+method must be one of ``ALL::LB_t::TENSOR``, ``...STAGGERED``,
+``...FORCEBASED``, ``...VORONOI``, ``...HISTOGRAM``. Where the Voronoi
+method must be enabled at compile time.  There is also a second form where
+the initial domain vertices are already provided:
+.. code-block:: c++
+    ALL::ALL<T,W> (
+                   const ALL::LB_t method,
+                   const int dimension,
+                   const std::vector<ALL::Point<T>> &initialDomain,
+                   const T gamma)
+Point object
+------------
+The domain boundaries are given and retrieved as a vector of
+``ALL::Point`` s.  These can be constructed via
+.. code-block:: c++
+    ALL::Point<T> ( const int dimension )
+    ALL::Point<T> ( const int dimension, const T *values )
+    ALL::Point<T> ( const std::vector<T> &values )
+and its elements can be accessed through the ``[]`` operator like a
+``std::vector``.
+Set up
+------
+A few additional parameters must be set before the domains can be
+balanced.  Which exactly need to be set, is dependent on the method used,
+but in general the following are necessary
+.. code-block:: c++
+    void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices )
+    void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm )
+    void ALL::ALL<T,W>::setWork ( const W work )
+If the given communicator is not a cartesian communicator the process grid
+parameters must also be set beforehand (!) using
+.. code-block:: c++
+    void ALL::ALL<T,W>::setProcGridParams (
+            const std::vector<int> &location,
+            const std::vector<int> &size)
+with the location of the current process and number of processes in each
+direction.
+To trace the current domain better an integer tag can be provided, which
+is used in the domain output, with
+.. code-block:: c++
+    void ALL::ALL<T,W>::setProcTag ( int tag )
+and an observed minimal domain size in each direction can be set using
+.. code-block:: c++
+    void ALL::ALL<T,W>::setMinDomainSize (
+            const std::vector<T> &minSize )
+Once these parameters are set call
+.. code-block:: c++
+    void ALL::ALL<T,W>::setup()
+so internal set up routines are run.
+Balancing
+---------
+To create the new domain vertices make sure the current vertices are set
+using ``setVertices`` and the work is set using ``setWork`` then call
+.. code-block:: c++
+    void ALL::ALL<T,W>::balance()
+and then the new vertices can be retrieved by using
+.. code-block:: c++
+    std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices ()
+or new neighbors with
+.. code-block:: c++
+    std::vector<int> &ALL::ALL<T,W>::getNeighbors ()
+Special considerations for the Fortran interface
+------------------------------------------------
+The Fortran interface exists in two versions. Either in the form of
+.. code-block:: fortran
+    ALL_balance(all_object)
+where the all object of type ``type(ALL_t)`` is given as first argument to
+each function and as more object oriented functions to the object itself.
+So the alternative call would be
+.. code-block:: fortran
+    all_object%balance()
+The function names themselves are similar, i.e. instead of camel case they
+are all lowercase except the first ``ALL_`` with underscores between
+words. So ``setProcGridParams`` becomes ``ALL_set_proc_grid_parms``. For
+more usage info please check the Fortran example programs.
+One important difference is the handling of MPI types, which change
+between the MPI Fortran 2008 interface as given by the ``mpi_f08`` module
+and previous interfaces. In previous interfaces all MPI types are
+``INTEGER`` s, but using the ``mpi_f08`` module the communicator is of
+``type(MPI_Comm)``. To allow using ``ALL_set_communicator`` directly with
+the communicator used in the user's application, build the library with
+enabled Fortran 2008 features and this communicator type is expected.
+Retrieving information from the balancer is also different, since most
+getter return (a reference to) an object itself. The Fortran subroutines
+set the values of its arguments. As an example
+.. code-block:: c++
+    int work = all.getWork();
+becomes
+.. code-block:: fortran
+    integer(c_int) :: work
+    call ALL_get_work(work) !or
+    !call all%get_work(work)
+Since there is no exception handling in Fortran, the error handling from
+the libc is borrowed. So there are additional function that retrieve and
+reset an error number as well as an additional error message describing
+the error. These must be queried and handled by the user. An example of
+this usage is shown in the ``ALL_Staggered_f`` example when printing the
+VTK outlines.
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -11,9 +11,15 @@ Welcome to the documentation of ALL
  :caption: Contents:
  :includehidden:
+  Install.rst
+  Usage.rst
+  Examples.rst
  api/index.rst
+  methods/index.rst
-todo(s.schulz): Link to doxygen documentation.
+The full (developer) documentation of the source code as provided by
+doxygen can be found `here <../doxygen/index.html>`_.
 Indices and tables
 ==================

--- a/docs/methods/Histogram.rst
+++ b/docs/methods/Histogram.rst
+.. _histogramDetails:
+Histogram method
+================
+The histogram-based method works differently than the first two methods
+in such a way, that it is a global method that requires three distinct
+steps for a single adjustment. In each of these steps the following
+takes place: a partial histogram needs to be created over the workload,
+e.g. number of particles, along one direction on each domain, then these
+are supplied to the method, where a global histogram is computed. With
+this global histogram a distribution function is created. This is used
+to compute the optimal (possible) width of domains in that direction.
+For the second and third steps the computation of the global histograms
+and distribution functions take place in subsystems, being the results
+of the previous step.  The result is the most optimal distribution of
+domains according to the Staggered-grid method, at the cost of global
+exchange of work, due to the global adjustment, which makes the method
+not well suited to highly dynamic systems, due to the need of frequent
+updates. On the other hand the method is well suited for static
+problems, e.g. grid-based simulations.
+Note: Currently the order of dimensions is: z-y-x.
+Required number of vertices
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+Additional requirements
+ - partial histogram created over the workload on the local domain in
+   the direction of the current correction step
+Advantages
+ - supplies an optimal distribution of domains (restricted by width of
+   bins used for the histogram)
+ - only three steps needed to acquire result
+Disadvantages
+ - expensive in cost of communication, due to global shifts
+ - requires preparation of histogram and shift of work between each of
+   the three correction steps
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
--- a/docs/methods/Staggered.rst
+++ b/docs/methods/Staggered.rst
+.. _staggeredDetails:
+Staggered method
+================
+The staggered grid approach is a hierarchical one. In a first step the
+work of all domains sharing the same cartesian coordinate with respect
+to the highest dimension (z in three dimensions) is collected. Then,
+like in the TENSOR strategy the layer width in this dimension is
+adjusted based on comparison of the collected work with the collected
+work of the neighboring domains in the same dimension. As a second step
+each of these planes is divided into a set of columns, where all domains
+share the same cartesian coordinate in the next lower dimension (y in
+three dimensions). For each of these columns the before described
+procedure is repeated, i.e. work collected and the width of the columns
+adjusted accordingly. Finally, in the last step the work of individual
+domains is compared to direct neighbors and the width in the last
+dimension (x in three dimensions) adjusted. This leads to a staggered
+grid, that is much better suited to describe inhomogeneous work
+distributions than the TENSOR strategy.
+Required number of vertices
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+Advantages
+ - very good equalization results for the work load
+ - maintains orthogonal domains
+Disadvantages
+ - changes topology of the domains and requires adjustment of neighbor
+   relations
+ - communication pattern in the calling code might require adjustment to
+   deal with changing neighbors
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
--- a/docs/methods/Tensor.rst
+++ b/docs/methods/Tensor.rst
+.. _tensorDetails:
+Tensor method
+=============
+In order to equalize the load of individual domains, the assumption is
+made that this can be achieved by equalizing the work in each cartesian
+direction, i.e. the work of all domains having the same coordinate in a
+cartesian direction is collected and the width of all these domains in
+this direction is adjusted by comparing this collective work with the
+collective work of the neighboring domains. This is done independently
+for each cartesian direction in the system.
+Required number of vertices
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+Advantages
+ - topology of the system is maintained (orthogonal domains, neighbor
+   relations)
+ - no update for the neighbor relations is required in the calling
+   code
+ - if a code was able to deal with orthogonal domains, only small
+   changes are expected to include this strategy
+Disadvantages
+ - due to the comparison of collective work loads in cartesian layers
+   and restrictions resulting from this construction, the final result
+   might lead to a sub-optimal domain distribution
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
--- a/docs/methods/index.rst
+++ b/docs/methods/index.rst
+The Loadbalancing Methods
+=========================
+.. toctree::
+  :hidden:
+  Tensor.rst
+  Staggered.rst
+  Histogram.rst
+A short overview of the methods is given below and more details can be
+found on their respective page.
+:ref:`Tensor product<tensorDetails>`
+  The work on all processes is reduced over the cartesian planes in the
+  systems. This work is then equalized by adjusting the borders of the
+  cartesian planes.
+:ref:`Staggered grid<staggeredDetails>`
+  A 3-step hierarchical approach is applied, where:
+  1. work over the cartesian planes is reduced, before the borders of
+     these planes are adjusted
+  2. in each of the cartesian planes the work is reduced for each
+     cartesian column. These columns are then adjusted to each other to
+     homogenize the work in each column
+  3. the work between neighboring domains in each column is adjusted.
+     Each adjustment is done locally with the neighboring planes,
+     columns or domains by adjusting the adjacent boundaries.
+:ref:`Topological mesh<topologicalDetails>` (WIP)
+  In contrast to the previous methods this method adjusts domains not by
+  moving boundaries but vertices, i.e.  corner points, of domains. For
+  each vertex a force, based on the differences in work of the neighboring
+  domains, is computed and the vertex is shifted in a way to equalize the
+  work between these neighboring domains.
+:ref:`Voronoi mesh<voronoiDetails>` (WIP)
+  Similar to the topological mesh method, this method computes a force,
+  based on work differences. In contrast to the topological mesh method,
+  the force acts on a Voronoi point rather than a vertex, i.e. a point
+  defining a Voronoi cell, which describes the domain. Consequently, the
+  number of neighbors is not a conserved quantity, i.e. the topology may
+  change over time. ALL uses the Voro++ library published by the Lawrence
+  Berkeley National Laboratory for the generation of the Voronoi mesh.
+:ref:`Histogram based staggered grid<histogramDetails>`
+  Resulting in the same grid, as the staggered grid scheme, this scheme
+  uses the cumulative work function in each of the three cartesian
+  directions in order to generate this grid. Using histograms and the
+  previously defined distribution of process domains in a cartesian grid,
+  this scheme generates in three steps a staggered-grid result, in which
+  the work is distributed as evenly as the resolution of the underlying
+  histogram allows. In contrast to the previously mentioned schemes this
+  scheme depends on a global exchange of work between processes.
+:ref:`Orthogonal recursive bisection<orthogonalDetails>` (Planned)
+  Comparable to the histogram based staggered grid scheme, this scheme
+  uses cumulative work functions evaluated by histograms in order to find
+  a new distribution of workload in a hierarchical manner. In contrast to
+  the histogram based staggered grid scheme, the subdivision of domains is
+  not based on a cartesian division of processes, but on the prime
+  factorization of the total number of processes. During each step, the
+  current slice of the system is distributed into smaller sub-slices along
+  the longest edge, that roughly contain the same workload. The number of
+  sub-slices in then determined by the corresponding prime number of that
+  step, i.e. when trying to establish a distribution for 24 (3 * 2 * 2 *
+  2) processes, the bisection would use four steps. In the first step the
+  system would be subdivided into 3 sub domains along the largest edge,
+  each containing roughly the same workload.  Each of these sub domains
+  then is divided into two sub sub domains independently. This procedure
+  is then repeated for all the prime numbers in the prime factorization.
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: