diff --git a/README.md b/README.md
index 002c9b12a67799397641df13ecde08e9737527e4..459a8d95423ba793af5fd95af014900338aa9ced 100644
--- a/README.md
+++ b/README.md
@@ -1,354 +1,44 @@
-A Load Balancing Library (ALL)
-
-The library aims to provide an easy way to include dynamic domain-based load balancing
-into particle based simulation codes. The library is developed in the Simulation Laboratory Molecular Systems of the Juelich Supercomputing Centre at Forschungszentrum Juelich. 
-
-It includes or is going to include several load-balancing schemes. The following list gives an overview about these schemes and short explanations ([o]: scheme fully included, [w]: implementation ongoing, [p]: implementation planned)
-
-- [o] Tensor-Product:      
-The work on all processes is reduced over the cartesian 
-planes in the systems. This work is then equalized by 
-adjusting the borders of the cartesian planes.
-                 
-- [o] Staggered-grid:     
-A 3-step hierarchical approach is applied, where:
-(i) work over the cartesian planes is reduced, before the borders of these 
-planes are adjusted; (ii) in each of the cartesian planes the work is 
-reduced for each cartesian column. These columns are then adjusted to each
-other to homogenize the work in each column; (iii) the work between 
-neighboring domains in each column is adjusted. Each adjustment is done
-locally with the neighboring planes, columns or domains by adjusting the 
-adjacent boundaries.
-                    
-- [w] Topological Mesh:   
-In contrast to the previous methods this method adjusts 
-domains not by moving boundaries but vertices, i.e. corner points, of
-domains. For each vertex a force, based on the differences
-in work of the neighboring domains, is computed and the
-vertex is shifted in a way to equalize the work between these
-neighboring domains.
-                    
-- [w] Voronoi Mesh:       
-Similar to the topological mesh method, this method computes a
-force, based on work differences. In contrast to the topological mesh
-method, the force acts on a Voronoi point rather than a vertex, i.e. a 
-point defining a Voronoi cell, which describes the domain. Consequently,
-the number of neighbors is not a conserved quantity, i.e. the topology
-may change over time. ALL uses the Voro++ library published by the
-Lawrance Berkeley Laboratory for the generation of the Voronoi mesh.
-
-- [o] Histogram-based Staggered-grid:
-Resulting in the same grid, as the staggered-grid scheme, this scheme uses
-the cumulative work function in each of the three cartesian directions in
-order to generate this grid. Using histograms and the previously defined
-distribution of process domains in a cartesian grid, this scheme generates
-in three steps a staggered-grid result, in which the work is distributed as
-evenly as the resolution of the underlying histogram allows. In constrast to
-the previously mentioned schemes this scheme depends on a global exchange of
-work between processes.
-
-- [p] Orthogonal Recursive Bisection
-Comparable to the Histogram-based Staggered-grid scheme, this scheme uses
-cumulative work functions evaluated by histograms in order to find a new
-distribution of workload in a hierarchical manner. In constrast to the 
-Histogram-based Staggered-grid scheme, the subdivision of domains is not
-based on a cartesian division of processes, but on the prime factorization
-of the total number of processes. During each step, the current slice of the
-system is distributed into smaller sub-slices along the longest edge, that 
-roughly contain the same workload. The number of sub-slices in then determined
-by the corresponding prime number of that step, i.e. when trying to establish
-a distribution for 24 (3 * 2 * 2 * 2) processes, the bisection would use four
-steps. In the first step the system would be subdivided into 3 subdomains along
-the largest edge, each containing roughly the same workload. Each of these subdomains
-then is divided into two subsubdomains indepenently. This procedure is then repeated
-for all the prime numbers in the prime factorisation.
-
-
-Installation & Requirements:
-
-    Base requirements:
-        C++11
-        MPI
-        CMake 3.14+
-
-    Optional requirements:
-        Fortran 2003
-        Fortran 2008 (if the more modern Fortran MPI interface is to be used)
-        VTK 7.1+ (for VTK-based output of the domains)
-        Boost testing utilities
-        Doxygen
-        Sphinx (with breathe extension)
-
-    Installation:
-
-        1.) Either clone the library from 
-                https://gitlab.version.fz-juelich.de/SLMS/loadbalancing
-            or download it from the same location into a directory on
-            your system ($ALL_ROOT_DIR).
-
-        2.) Create a build-directory $ALL_BUILD_DIR in $ALL_ROOT_DIR, change into $ALL_BUILD_DIR and call
-                cmake $ALL_ROOT_DIR <options>
-            which sets up the installation. There are some optional features,
-            you can set up with the following options:
-                -DCMAKE_INSTALL_PREFIX=<$ALL_INSTALL_DIR> [default: depends on system]
-                    sets the directory $ALL_INSTALL_DIR, into which ?make install? copies the compiled library
-                    and examples
-                -DCM_ALL_VTK_OUTPUT=ON/OFF [default: OFF]
-                    enables/disables the VTK based output of the domains (requires VTK 7.1+)
-                -DCM_ALL_FORTRAN=ON/OFF [default: OFF]
-                    compiles the Fortran interface and example
-                -DCM_ALL_VORONOI=ON/OFF [default: OFF]
-                    activates the compilation of the Voronoi-mesh scheme (and the compilation of Voro++)
-                -DCM_ALL_USE_F08=ON/OFF [default: OFF]
-                    activates the Fortran 2008 MPI interface
-                -DCM_ALL_TESTING=ON/OFF [default: OFF]
-                    activates unit tests, which can be run with make test
-
-        3.) Execute make to compile and install the library to the previously set
-            directory:
-                make
-                make install
-
-    After "make install" the compiled library and the compiled examples are located in the directory $ALL_INSTALL_DIR.
-
-Usage:
-    
-    ALL uses C++ template programming to deal with different data types that describe domain
-    boundaries and domain based work. In order to capsulate the data, ALL uses a class in which
-    required data and the computed results of a load-balancing routine are saved and can be
-    accessed from. To include the library to an existing code, you need to do the following steps:
-
-    1.) Create an object of the load-balancing class:
-       
-            ALL<T,W> ()
-            ALL<T,W> ( int dimension, T gamma )  
-            ALL<T,W> ( int dimension, std::vector<ALL_Point<T>> vertices, T gamma )
-
-        As mentioned before the library uses template programming, where T is the data type used
-        to describe boundaries between domains and vertices (usually float or double) and W is the
-        data type used to describe the work-load of a process (usually float or double).
-        The first version of the constructor defines a base object of the load-balancing class that
-        contains no data, using a three-dimensional system. The second constructor sets up the
-        system dimension (currently only three-dimensional systems are supported) and the relaxation
-        parameter gamma, which controls the convergence of the load-balancing methods. In the third described version 
-        of the ALL<T,W> constructor a set of vertices describing the local domain is already passed, using the
-        ALL_Point Class, described below.
-
-            ALL_Point<T>( const int dimension )
-            ALL_Point<T>( const int dimension, const T* values )
-            ALL_Point<T>( const std::vector<T>& values )
-
-            T& ALL_Point<T>::operator[]( const int index )
-
-        ALL_Point is a class describing a point in space, where the dimension of the space is given
-        by the input parameter. It can be initialized by either using an array of datatype T or a
-        std::vector<T>. In the latter case the passing of a dimension is not required and the dimension
-        of the point is derived from the length of the std::vector. For the initialization with an
-        array the user has to check that the passed array is of sufficient length, i.e. of length
-        dimension (or longer).
-        To update or access initialized ALL_Point objects, the [] operator is overloaded and the n-th
-        component of a ALL_Point object ?p? can be updated or accessed by p[n].
-
-    2.) Setup basic parameters of the system:
-        
-        There are three input parameters that are required for the library:
-
-            a) vertices describing the local domain
-            b) work load for the local domain
-            c) cartesian MPI communicator on which the program is executed
-               (requirement of cartesian communicator is under review)
-
-        These parameters can be set with the following methods:
-
-            void ALL<T,W>::set_vertices(std::vector<ALL_Point<T>>& vertices)
-            void ALL<T,W>::set_vertices(const int n, const int dimension, const T*)
-
-            void ALL<T,W>::set_communicator(MPI_Comm comm)
-
-            void ALL<T,W>::set_work(const W work)
-       
-        The vertices can be set by either using a std::vector of ALL_Point data, from which the number
-        of vertices and the dimension of each point can be derived, or by passing an array of data type
-        T, which requires that the number of vertices and the dimension of the vertices are also passed.
-        For the MPI communicator the MPI communicator used by the calling program needs to be passed and
-        for the work a single value of data type W needs to be passed.
-
-    3.) Setting up the chosen load balancing routine
-
-        To set up the required internal data structures a call of the following method is required:
-
-            void ALL<T,W>::setup( short method )
-
-            with ?method? being:
-            
-            ALL_LB_t::TENSOR
-            ALL_LB_t::STAGGERED
-            ALL_LB_t::HISTOGRAM
-
-        With the keyword method the load balancing strategy is chosen, given by the list above. Starting 
-        point for all methods, described below is a given domain structure (e.g. the one which is initially 
-        set up by the program). In the case of TENSOR and STAGGERED, the domains need to be orthogonal. 
-        For these two methods, the procedure to adjust the work load between domains is a multi-step 
-        approach where in each step, sets of domains are combined into super-domains, the work of which 
-        is mutually adjusted. 
-
-        Short overview about the methods:
-
-            ALL_LB_t::TENSOR
-            
-                In order to equalize the load of individual domains, the assumption is made that this
-                can be achieved by equalizing the work in each cartesian direction, i.e. the work
-                of all domains having the same coordinate in a cartesian direction is collected and
-                the width of all these domains in this direction is adjusted by comparing this
-                collective work with the collective work of the neighboring domains. This is done
-                independently for each cartesian direction in the system.
-
-                Required number of vertices:
-                    two, one describing the lower left front point and one describing the 
-                    upper right back point of the domain
-
-                Advantages:
-                    - topology of the system is maintained (orthogonal domains, neighbor relations)
-                    - no update for the neighbor relations need to be made in the calling code
-                    - if a code was able to deal with orthogonal domains, only small changes are 
-                      expected to include this strategy
-                
-                Disadvantages:
-                    - due to the comparison of collective work loads in cartesian layers and 
-                      restrictions resulting from this construction, the final result might 
-                      lead to a sub-optimal domain distribution
-
-            ALL_LB_t::STAGGERED:
-
-                The staggered grid approach is a hierarchical one. In a first step the work of all
-                domains sharing the same cartesian coordinate with respect to the highest dimension
-                (z in three dimensions) is collected. Then, like in the TENSOR strategy the layer width in
-                this dimension is adjusted based on comparison of the collected work with the collected
-                work of the neighboring domains in the same dimension. As a second step each of these planes
-                is divided into a set of columns, where all domains share the same cartesian coordinate
-                in the next lower dimension (y in three dimensions). For each of these columns the
-                before described procedure is repeated, i.e. work collected and the width of the
-                columns adjusted accordingly. Finally, in the last step the work of individual domains
-                is compared to direct neighbors and the width in the last dimension (x in three
-                dimensions) adjusted. This leads to a staggered grid, that is much better suited to
-                describe inhomogenous work distributions than the TENSOR strategy.
-
-                Required number of vertices:
-                    two, one describing the lower left front point and one describing the 
-                    upper right back point of the domain
-
-                Advantages:
-                    - very good equalization results for the work load
-                    - maintains orthogonal domains
-
-                Disadvantages:
-                    - changes topology of the domains and requires adjustment of neighbor relations
-                    - communication pattern in the calling code might require adjustment to deal
-                      with changing neighbors
-
-            ALL_LB_t::HISTOGRAM:
-
-                The histogram-based method works differently than the first two methods in such a way, that it is a
-                global method that requires three distinct steps for a single adjustment. In each of these steps
-                the following takes place: a partial histogram needs to be created over the workload, e.g. number of particles,
-                along one direction on each domain, then these are supplied to the method, where a global histogram
-                is computed. With this global histogram a distribution function is created. This is used to compute the optimal
-                (possible) width of domains in that direction. For the second and third steps the computation of the global
-                histograms and distribution functions take place in subsystems, being the results of the previous step. 
-                The result is the most optimal distribution of domains according to the Staggered-grid method, at the cost of
-                global exchange of work, due to the global adjustment, which makes the method not well suited to highly dynamic
-                systems, due to the need of frequent updates. On the other hand the method is well suited for static problems,
-                e.g. grid-based simulations.
-
-                Note: Currently the order of dimensions is: z-y-x.
-
-                Required number of vertices:
-                    two, one describing the lower left front point and one describing the 
-                    upper right back point of the domain
-
-                Additional requirements:
-                    partial histogram created over the workload on the local domain in the direction of the current correction step
-
-                Advantages:
-                    - supplies an optimal distribution of domains (restricted by width of bins used for the histogram)
-                    - only three steps needed to acquire result
-
-                Disadvantages:
-                    - expensive in cost of communication, due to global shifts
-                    - requires preparation of histogram and shift of work between each of the three correction steps
-
-    4.) Computing new boundaries / vertices
-
-            void ALL<T,W>::balance(short method)
-        
-        The balance method starts the computation of the new vertices, according to the chosen method.
-        The chosen method is required to be the same that was used in the call to the setup method. If
-        required new neighbor relations are computed as well.
-
-     5.) Accessing the results
-
-            std::vector<ALL_Point<T>>& ALL<T,W>::get_result_vertices()
-            
-            int ALL<T,W> ALL<T,W>::get_n_neighbors(short method)
-            void ALL<T,W> ALL<T,W>::get_neighbors(short method, int** neighbors)
-
-            void ALL<T,W> ALL<T,W>::get_neighbors(short method, std::vector<int>& neighbors)
-
-
-        In order to access the resulting vertices and neighbors, the above three methods can be used. If the
-        array version of get_neighbors is used, the address to the neighbor array is returned in
-        neighbors, therefore a int** is passed.
-
-Examples:
-
-    In the distribution two example codes are included, one written in C++, one written in Fortran. The main
-    goal of these examples is to show how the library can be included into a particle code.
-
-    ALL_test: 
-
-        MPI C++ code that generates a particle distribution over the domains. At program start the domains form
-        an orthogonal decomposition of the cubic 3d system. Each domain has the same volume as each other domain.
-        Depending on the cartesian coordinates of the domain in this decomposition, a number of points is created
-        on each domain. The points are then distributed uniformly over the domain. For the number of points a
-        polynomial formula is used to create a non-uniform point disribution. As an estimation of the work load
-        in this program the number of points within a domain was chosen. There is no particle movement included
-        in the current version of the example code, therefore particle communication only takes place due to the
-        shift of domain borders.
-
-        The program creates three output files in its basic version:
-
-            minmax.dat:
-                contains four columns:
-                    <iteration count> <W_min/W_avg> <W_max_W_avg> <(W_max-W_min)/(W_max+W_min)>
-                explanation:
-                    In order to try to give a comparable metric for the success of the load balancing procedure
-                    the relatative difference of the minimum and maximum work loads in the system in relation
-                    to the average work load of the system are given. The last column gives an indicator for
-                    the inequality of work in the system, in a perfectly balanced system, this value should be
-                    equal to zero.
-            
-            stddev.txt:
-                contains two columns:
-                    <iteration count> <std_dev>
-                explanation:
-                    The standard deviation of work load over all domains.
-
-            domain_data.dat:
-                contains two larger sets of data:
-                    <rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W>
-                explanation:
-                    There are two blocks of rows, looking like this, the first block is the starting configuration
-                    of the domains, which should be a uniform grid of orthogonal domains. The second block is the
-                    configuration after 200 load balancing steps. In each line the MPI rank of the domain and the
-                    extension in each cartesian direction is given, as well as the work load (the number of points).
-
-        If the library is compiled with VTK support, ALL_test also creates a VTK based output of the domains in each
-        iteration step. In order to work, a directory named "vtk_outline" needs to be created in the directory where
-        the executable is located. The resulting output can be visualized with tools like ParaView or VisIt.
-
-   ALL_test_f: 
-
-        The Fortran example provides a more basic version of the test program ALL_test, as its main goal is to show
-        the functuality of the Fortran interface. The code creates a basic uniform orthogonal domain decomposition
-        and creates an inhomogenous particle distribution over these. Only one load balancing step is executed and
-        the program prints out the domain distribution of the start configuration and of the final configuration.
+# A Load Balancing Library (ALL)
+
+The library aims to provide an easy way to include dynamic domain-based
+load balancing into particle based simulation codes. The library is
+developed in the Simulation Laboratory Molecular Systems of the Jülich
+Supercomputing Centre at Forschungszentrum Jülich. 
+
+Only a brief summary is given here and more information can be found in
+the [official documentation](docs/index.rst) such as a detailed API
+description, examples and further information regarding the load
+balancing methods.
+
+## Installation and Requirements
+
+### Requirements
+Base requirements:
+
+ - C++11 capable compiler
+ - MPI support
+ - CMake v. 3.14 or higher
+
+Optional requirements:
+
+ - Fortran 2003 capable compiler (Fortran interface)
+ - Fortran 2008 capable compiler (Usage of the `mpi_f08` interface)
+ - VTK 7.1 or higher (Domain output)
+ - Boost testing utilities
+ - Doxygen and Sphinx with `breathe` (Documentation)
+
+### Installation
+
+ 1. Clone the library from
+    `https://gitlab.version.fz-juelich.de/SLMS/loadbalancing` into
+    `$ALL_ROOT_DIR`.
+ 2. Create the build directory `$ALL_BUILD_DIR` some place else.
+ 3. Call `cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"` to set up the
+    installation.  To use a specific compiler and Boost installation
+    use: `CC=gcc CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`.
+  4. To build and install the library then run: `cmake --build
+     "$ALL_BUILD_DIR"` and `cmake --install "$ALL_BUILD_DIR"`.
+     Afterwards, the built examples and library files are placed in
+     `$ALL_INSTALL_DIR`.
+
+<!-- vim: set tw=72 et ts=4 spell spelllang=en_us ft=markdown: -->
diff --git a/README_new.md b/README_new.md
deleted file mode 100644
index c634055fa0b33e4ee2dc38a4fc6c8132fc3d866f..0000000000000000000000000000000000000000
--- a/README_new.md
+++ /dev/null
@@ -1,422 +0,0 @@
-# A Load Balancing Library (ALL)
-
-The library aims to provide an easy way to include dynamic domain-based
-load balancing into particle based simulation codes. The library is
-developed in the Simulation Laboratory Molecular Systems of the Jülich
-Supercomputing Centre at Forschungszentrum Jülich. 
-
-It includes several load balancing schemes. The following list gives an
-overview about these schemes and short explanations.
-
- - **Tensor product**: The work on all processes is reduced over the
-   cartesian planes in the systems. This work is then equalized by
-   adjusting the borders of the cartesian planes.
- - **Staggered grid**: A 3-step hierarchical approach is applied, where:
-   1. work over the cartesian planes is reduced, before the borders of
-      these planes are adjusted
-   2. in each of the cartesian planes the work is reduced for each
-      cartesian column. These columns are then adjusted to each other to
-      homogenize the work in each column
-   3. the work between neighboring domains in each column is adjusted.
-      Each adjustment is done locally with the neighboring planes,
-      columns or domains by adjusting the adjacent boundaries.
- - **Topological mesh** (WIP): In contrast to the previous methods this
-   method adjusts domains not by moving boundaries but vertices, i.e.
-   corner points, of domains. For each vertex a force, based on the
-   differences in work of the neighboring domains, is computed and the
-   vertex is shifted in a way to equalize the work between these
-   neighboring domains.
- - **Voronoi mesh** (WIP): Similar to the topological mesh method, this
-   method computes a force, based on work differences. In contrast to
-   the topological mesh method, the force acts on a Voronoi point rather
-   than a vertex, i.e. a point defining a Voronoi cell, which describes
-   the domain. Consequently, the number of neighbors is not a conserved
-   quantity, i.e. the topology may change over time. ALL uses the Voro++
-   library published by the Lawrence Berkeley National Laboratory for the
-   generation of the Voronoi mesh.
- - **Histogram based staggered grid**: Resulting in the same grid, as
-   the staggered grid scheme, this scheme uses the cumulative work
-   function in each of the three cartesian directions in order to
-   generate this grid. Using histograms and the previously defined
-   distribution of process domains in a cartesian grid, this scheme
-   generates in three steps a staggered-grid result, in which the work
-   is distributed as evenly as the resolution of the underlying
-   histogram allows. In contrast to the previously mentioned schemes
-   this scheme depends on a global exchange of work between processes.
- - **Orthogonal recursive bisection** (Planned): Comparable to the
-   histogram based staggered grid scheme, this scheme uses cumulative
-   work functions evaluated by histograms in order to find a new
-   distribution of workload in a hierarchical manner. In contrast to the
-   histogram based staggered grid scheme, the subdivision of domains is
-   not based on a cartesian division of processes, but on the prime
-   factorization of the total number of processes. During each step, the
-   current slice of the system is distributed into smaller sub-slices
-   along the longest edge, that roughly contain the same workload. The
-   number of sub-slices in then determined by the corresponding prime
-   number of that step, i.e. when trying to establish a distribution for
-   24 (3 * 2 * 2 * 2) processes, the bisection would use four steps. In
-   the first step the system would be subdivided into 3 sub domains
-   along the largest edge, each containing roughly the same workload.
-   Each of these sub domains then is divided into two sub sub domains
-   independently. This procedure is then repeated for all the prime
-   numbers in the prime factorization.
-
-
-## Installation and Requirements
-
-### Requirements
-Base requirements:
-
- - C++11 capable compiler
- - MPI support
- - CMake v. 3.14 or higher
-
-Optional requirements:
-
- - Fortran 2003 capable compiler (Fortran interface)
- - Fortran 2008 capable compiler (Usage of the `mpi_f08` interface)
- - VTK 7.1 or higher (Domain output)
- - Boost testing utilities
- - Doxygen and Sphinx with `breathe` (Documentation)
-
-### Installation
-
- 1. Clone the library from
-    `https://gitlab.version.fz-juelich.de/SLMS/loadbalancing` into
-    `$ALL_ROOT_DIR`.
- 2. Create the build directory `$ALL_BUILD_DIR` some place else.
- 3. Call `cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"` to set up the
-    installation. Compile time features can be selected from:
-     - `-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR` (default: system
-       dependant): Install into `$ALL_INSTALL_DIR` after compiling.
-     - `-DCM_ALL_VTK_OUTPUT=ON` (default: `OFF`): Enable VTK output
-       of domains. Requires VTK 7.1 or higher.
-     - `-DCM_ALL_FORTRAN=ON` (default: `OFF`): Enable Fortran interface.
-       Requires Fortran 2003 capable compiler.
-     - `-DCM_ALL_USE_F08=ON` (default: `OFF`): Enable usage of `mpi_f08`
-       module for MPI. Requires Fortran 2008 capable compiler and
-       compatible MPI installation.
-     - `-DCM_ALL_FORTRAN_ERROR_ABORT=ON` (default: `OFF): Abort
-       execution on any error when using the Fortran interface instead
-       of setting the error number and leaving error handling to the
-       user.
-     - `-DCM_ALL_VORONOI=ON` (default: `OFF`): Enable Voronoi mesh
-       method and subsequently compilation and linkage of Voro++.
-     - `-DCM_ALL_TESTING=ON` (default: `OFF`): Turns on unit and feature
-       tests. Can be run from the build directory with `ctest`. Requires
-       the Boost test utilities.
-     - `-DCMAKE_BUILD_TYPE=Debug` (default: `Release`): Enable library
-       internal debugging features. Using `DebugWithOpt` also turns on
-       some optimizations.
-     - `-DCM_ALL_DEBUG=ON` (default: `OFF`): Enable self consistency
-       checks within the library itself.
-     - `-DVTK_DIR=$VTK_DIR` (default: system dependant): Path to an
-       explicit VTK installation. Make sure that VTK is compiled with
-       the same MPI as the library and subsequently your code.
-    To use a specific compiler and Boost installation use: `CC=gcc
-    CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`.
-  4. To build and install the library then run: `cmake --build
-     "$ALL_BUILD_DIR"` and `cmake --install "$ALL_BUILD_DIR"`.
-     Afterwards, the built examples and library files are placed in
-     `$ALL_INSTALL_DIR`.
-
-
-## Usage
-The library is presented as a single load balancing object `ALL::ALL`
-from `ALL.hpp`. All classes are encapsulated in the `ALL` name space.
-Simple (and not so simple) examples are available in the `/examples` sub
-directory. The simple examples are also documented at length.
-
-Errors are treated as exceptions that are thrown of a (sub) class of
-`ALL::CustomException`. Likely candidates that may throw exceptions
- are `balance`,`setup`, `printVTKoutlines`.
-
-### ALL object
-The ALL object can be constructed with
-
-    ALL::ALL<T,W> (
-                   const ALL::LB_t method,
-                   const int dimension,
-                   const T gamma)
-
-Where the type of the boundaries is given by `T` and the type of the
-work as `W`, which are usually float or double. The load balancing
-method must be one of `ALL::LB_t::TENSOR`, `...STAGGERED`,
-`...FORCEBASED`, `...VORONOI`, `...HISTOGRAM`. Where the Voronoi
-method must be enabled at compile time. There is also a second form
-where the initial domain vertices are already provided:
-
-    ALL::ALL<T,W> (
-                   const ALL::LB_t method,
-                   const int dimension,
-                   const std::vector<ALL::Point<T>> &initialDomain,
-                   const T gamma)
-
-### Point object
-The domain boundaries are given and retrieved as a vector of
-`ALL::Point`s. These can be constructed via
-
-    ALL::Point<T> ( const int dimension )
-    ALL::Point<T> ( const int dimension, const T *values )
-    ALL::Point<T> ( const std::vector<T> &values )
-
-and its elements can be accessed through the `[]` operator like a `std::vector`
-
-### Set up
-A few additional parameters must be set before the domains can be
-balanced. Which exactly need to be set, is dependent on the method used, 
-but in general the following are necessary
-
-    void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices )
-    void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm )
-    void ALL::ALL<T,W>::setWork ( const W work )
-
-If the given communicator is not a cartesian communicator the process
-grid parameters must also be set beforehand (!) using
-
-    void ALL::ALL<T,W>::setProcGridParams (
-            const std::vector<int> &location,
-            const std::vector<int> &size)
-
-with the location of the current process and number of processes in each
-direction.
-
-To trace the current domain better an integer tag can be provided, which is
-used in the domain output, with
-
-    void ALL::ALL<T,W>::setProcTag ( int tag )
-
-and an observed minimal domain size in each direction can be set using
-
-    void ALL::ALL<T,W>::setMinDomainSize (
-            const std::vector<T> &minSize )
- 
-Once these parameters are set call
-
-    void ALL::ALL<T,W>::setup()
-
-so internal set up routines are run.
-
-### Balancing
-To create the new domain vertices make sure the current vertices are set
-using `setVertices` and the work is set using `setWork` then call
-
-    void ALL::ALL<T,W>::balance()
-
-and then the new vertices can be retrieved by using
-
-    std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices ()
-
-or new neighbors with
-
-    std::vector<int> &ALL::ALL<T,W>::getNeighbors ()
-
-### Special considerations for the Fortran interface
-The Fortran interface exists in two versions. Either in the form of
-
-    ALL_balance(all_object)
-
-where the all object of type `type(ALL_t)` is given as first argument to
-each function and as more object oriented functions to the object
-itself. So the alternative call would be
-
-    all_object%balance()
-
-The function names themselves are similar, i.e. instead of camel case
-they are all lowercase except the first `ALL_` with underscores between
-words. So `setProcGridParams` becomes `ALL_set_proc_grid_parms`. For
-more usage info please check the Fortran example programs.
-
-One important difference is the handling of MPI types, which change
-between the MPI Fortran 2008 interface as given by the `mpi_f08` module
-and previous interfaces. In previous interfaces all MPI types are
-`INTEGER`s, but using the `mpi_f08` module the communicator is of
-`type(MPI_Comm)`. To allow using `ALL_set_communicator` directly with
-the communicator used in the user's application, build the library with
-enabled Fortran 2008 features and this communicator type is expected.
-
-Retrieving information from the balancer is also different, since most
-getter return (a reference to) an object itself. The Fortran subroutines
-set the values of its arguments. As an example
-
-    int work = all.getWork();
-
-becomes
-
-    integer(c_int) :: work
-    call ALL_get_work(work) !or
-    !call all%get_work(work)
-
-Since there is no exception handling in Fortran, the error handling from
-the libc is borrowed. So there are additional function that retrieve and
-reset an error number as well as an additional error message describing
-the error. These must be queried and handled by the user. An example of
-this usage is shown in the `ALL_Staggered_f` example when printing the
-VTK outlines.
-
-### Details of the balancing methods
-#### Tensor based
-In order to equalize the load of individual domains, the assumption is
-made that this can be achieved by equalizing the work in each cartesian
-direction, i.e. the work of all domains having the same coordinate in a
-cartesian direction is collected and the width of all these domains in
-this direction is adjusted by comparing this collective work with the
-collective work of the neighboring domains. This is done independently
-for each cartesian direction in the system.
-
-Required number of vertices:
- - two, one describing the lower left front point and one describing the
-   upper right back point of the domain
-
-Advantages:
- - topology of the system is maintained (orthogonal domains, neighbor
-   relations)
- - no update for the neighbor relations is required in the calling
-   code
- - if a code was able to deal with orthogonal domains, only small
-   changes are expected to include this strategy
-
-Disadvantages:
- - due to the comparison of collective work loads in cartesian layers
-   and restrictions resulting from this construction, the final result
-   might lead to a sub-optimal domain distribution
-
-#### Staggered grid
-The staggered grid approach is a hierarchical one. In a first step the
-work of all domains sharing the same cartesian coordinate with respect
-to the highest dimension (z in three dimensions) is collected. Then,
-like in the TENSOR strategy the layer width in this dimension is
-adjusted based on comparison of the collected work with the collected
-work of the neighboring domains in the same dimension. As a second step
-each of these planes is divided into a set of columns, where all domains
-share the same cartesian coordinate in the next lower dimension (y in
-three dimensions). For each of these columns the before described
-procedure is repeated, i.e. work collected and the width of the columns
-adjusted accordingly. Finally, in the last step the work of individual
-domains is compared to direct neighbors and the width in the last
-dimension (x in three dimensions) adjusted. This leads to a staggered
-grid, that is much better suited to describe inhomogeneous work
-distributions than the TENSOR strategy.
-
-Required number of vertices:
- - two, one describing the lower left front point and one describing the
-   upper right back point of the domain
-
-Advantages:
- - very good equalization results for the work load
- - maintains orthogonal domains
-
-Disadvantages:
- - changes topology of the domains and requires adjustment of neighbor
-   relations
- - communication pattern in the calling code might require adjustment to
-   deal with changing neighbors
-
-#### Histogram based staggered grid
-
-The histogram-based method works differently than the first two methods
-in such a way, that it is a global method that requires three distinct
-steps for a single adjustment. In each of these steps the following
-takes place: a partial histogram needs to be created over the workload,
-e.g. number of particles, along one direction on each domain, then these
-are supplied to the method, where a global histogram is computed. With
-this global histogram a distribution function is created. This is used
-to compute the optimal (possible) width of domains in that direction.
-For the second and third steps the computation of the global histograms
-and distribution functions take place in subsystems, being the results
-of the previous step.  The result is the most optimal distribution of
-domains according to the Staggered-grid method, at the cost of global
-exchange of work, due to the global adjustment, which makes the method
-not well suited to highly dynamic systems, due to the need of frequent
-updates. On the other hand the method is well suited for static
-problems, e.g. grid-based simulations.
-
-Note: Currently the order of dimensions is: z-y-x.
-
-Required number of vertices:
- - two, one describing the lower left front point and one describing the
-   upper right back point of the domain
-
-Additional requirements:
- - partial histogram created over the workload on the local domain in
-   the direction of the current correction step
-
-Advantages:
- - supplies an optimal distribution of domains (restricted by width of
-   bins used for the histogram)
- - only three steps needed to acquire result
-
-Disadvantages:
- - expensive in cost of communication, due to global shifts
- - requires preparation of histogram and shift of work between each of
-   the three correction steps
-
-
-## Example programs
-In the `/examples` sub directory several C++ and Fortran example
-programs are placed.
-
-### `ALL_test`
-MPI C++ code that generates a particle distribution over the domains. At
-program start the domains form an orthogonal decomposition of the cubic
-3d system. Each domain has the same volume as each other domain.
-Depending on the cartesian coordinates of the domain in this
-decomposition, a number of points is created on each domain. The points
-are then distributed uniformly over the domain. For the number of points
-a polynomial formula is used to create a non-uniform point distribution.
-As an estimation of the work load in this program the number of points
-within a domain was chosen. There is no particle movement included in
-the current version of the example code, therefore particle
-communication only takes place due to the shift of domain borders.
-
-The program creates three output files in its basic version:
-
- - `minmax.dat`:
-    - **Columns**: `<iteration count> <W_min/W_avg> <W_max_W_avg>
-      <(W_max-W_min)/(W_max+W_min)>`
-    - **Explanation**: In order to try to give a comparable metric for
-      the success of the load balancing procedure the relatative
-      difference of the minimum and maximum work loads in the system in
-      relation to the average work load of the system are given. The
-      last column gives an indicator for the inequality of work in the
-      system, in a perfectly balanced system, this value should be equal
-      to zero.
-    
- - `stddev.txt`:
-    - **Columns**: `<iteration count> <std_dev>`
-    - **Explanation**: The standard deviation of work load over all
-      domains.
-
- - `domain_data.dat`:
-    - **Columns**: `<rank> <x_min> <x_max> <y_min> <y_max> <z_min>
-      <z_max> <W>`
-    - **Explanation**: There are two blocks of rows, looking like this,
-      the first block is the starting configuration of the domains,
-      which should be a uniform grid of orthogonal domains. The second
-      block is the configuration after 200 load balancing steps. In each
-      line the MPI rank of the domain and the extension in each
-      cartesian direction is given, as well as the work load (the number
-      of points).
-
-If the library is compiled with VTK support, `ALL_test` also creates a VTK
-based output of the domains in each iteration step. This output will be
-placed in a `vtk_outline` sub directory, which is created if it does not
-already exist. The resulting output can be visualized with tools like
-ParaView or VisIt.
-
-### `ALL_test_f` and `ALL_test_f_obj`
-The Fortran example provides a more basic version of the test program
-`ALL_test`, as its main goal is to show the functionality of the Fortran
-interface. The code creates a basic uniform orthogonal domain
-decomposition and creates an inhomogeneous particle distribution over
-these. Only one load balancing step is executed and the program prints
-out the domain distribution of the start configuration and of the final
-configuration.
-
-### `ALL_Staggered` and `ALL_Staggered_f`
-These create a very simple load balancing situation with fixed load and
-domains placed along the z direction. The C++ and Fortran versions are
-functionally the same, so the differences between the interfaces can be
-checked.
-
-<!-- vim: set tw=72 et ts=4 spell spelllang=en_us ft=markdown: -->
diff --git a/docs/Examples.rst b/docs/Examples.rst
new file mode 100644
index 0000000000000000000000000000000000000000..84c5e396cdf83e7bd40c866dc433ff1466e5d4a9
--- /dev/null
+++ b/docs/Examples.rst
@@ -0,0 +1,76 @@
+.. _examples:
+
+Examples
+========
+In the ``/examples`` sub directory several C++ and Fortran example
+programs are placed.
+
+``ALL_test``
+------------
+MPI C++ code that generates a particle distribution over the domains. At
+program start the domains form an orthogonal decomposition of the cubic 3d
+system. Each domain has the same volume as each other domain. Depending
+on the cartesian coordinates of the domain in this decomposition, a number
+of points is created on each domain. The points are then distributed
+uniformly over the domain. For the number of points a polynomial formula
+is used to create a non-uniform point distribution. As an estimation of
+the work load in this program the number of points within a domain was
+chosen. There is no particle movement included in the current version of
+the example code, therefore particle communication only takes place due to
+the shift of domain borders.
+
+The program creates three output files in its basic version:
+
+``minmax.dat``
+  Columns
+    ``<iteration count> <W_min/W_avg> <W_max_W_avg>
+    <(W_max-W_min)/(W_max+W_min)>``
+  Explanation
+    In order to try to give a comparable metric for the success of the
+    load balancing procedure the relative difference of the minimum and
+    maximum work loads in the system in relation to the average work load
+    of the system are given. The last column gives an indicator for the
+    inequality of work in the system, in a perfectly balanced system, this
+    value should be equal to zero.
+    
+``stddev.txt``
+  Columns
+    ``<iteration count> <std_dev>``
+  Explanation
+    The standard deviation of work load over all domains.
+
+``domain_data.dat``
+  Columns
+    ``<rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W>``
+  Explanation
+    There are two blocks of rows, looking like this, the first block is
+    the starting configuration of the domains, which should be a uniform
+    grid of orthogonal domains. The second block is the configuration
+    after 200 load balancing steps. In each line the MPI rank of the
+    domain and the extension in each cartesian direction is given, as well
+    as the work load (the number of points).
+
+If the library is compiled with VTK support, ``ALL_test`` also creates a
+VTK based output of the domains in each iteration step. This output will
+be placed in a ``vtk_outline`` sub directory, which is created if it does
+not already exist. The resulting output can be visualized with tools like
+ParaView or VisIt.
+
+``ALL_test_f`` and ``ALL_test_f_obj``
+-------------------------------------
+The Fortran example provides a more basic version of the test program
+``ALL_test``, as its main goal is to show the functionality of the Fortran
+interface. The code creates a basic uniform orthogonal domain
+decomposition and creates an inhomogeneous particle distribution over
+these. Only one load balancing step is executed and the program prints
+out the domain distribution of the start configuration and of the final
+configuration.
+
+``ALL_Staggered`` and ``ALL_Staggered_f``
+-----------------------------------------
+These create a very simple load balancing situation with fixed load and
+domains placed along the z direction. The C++ and Fortran versions are
+functionally the same, so the differences between the interfaces can be
+checked.
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
diff --git a/docs/Install.rst b/docs/Install.rst
new file mode 100644
index 0000000000000000000000000000000000000000..af003718a2b9d4637decfc4d3c56918842291cff
--- /dev/null
+++ b/docs/Install.rst
@@ -0,0 +1,65 @@
+.. _install:
+
+Installation
+============
+
+Requirements
+------------
+
+Base requirements
+ - C++11 capable compiler
+ - MPI support
+ - CMake v.3.14 or higher
+
+Optional requirements
+ - Fortran 2003 capable compiler (Fortran interface)
+ - Fortran 2008 capable compiler (Usage of the ``mpi_f08`` interface)
+ - VTK 7.1 or higher (Domain output)
+ - Boost testing utilities
+ - Doxygen and Sphinx with ``breathe`` (Documentation)
+
+Installation
+------------
+
+1. Clone the library from
+   ``https://gitlab.version.fz-juelich.de/SLMS/loadbalancing`` into
+   ``$ALL_ROOT_DIR``.
+2. Create the build directory ``$ALL_BUILD_DIR`` some place else.
+3. Call ``cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"`` to set up the
+   installation. Other compile time features should be set here. To use a
+   specific compiler and Boost installation use: ``CC=gcc CXX=g++
+   BOOST_ROOT=$BOOST_DIR cmake [...]``.
+4. To build and install the library then run: ``cmake --build
+   "$ALL_BUILD_DIR"`` and ``cmake --install "$ALL_BUILD_DIR"``. Afterwards,
+   the built examples and library files are placed in ``$ALL_INSTALL_DIR``.
+
+Compile time features
+*********************
+``-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR`` (default: system dependent)
+  Install into ``$ALL_INSTALL_DIR`` after compiling.
+``-DCM_ALL_VTK_OUTPUT=ON`` (default: ``OFF``)
+  Enable VTK output of domains. Requires VTK 7.1 or higher.
+``-DCM_ALL_FORTRAN=ON`` (default: ``OFF``)
+  Enable Fortran interface.  Requires Fortran 2003 capable compiler.
+``-DCM_ALL_USE_F08=ON`` (default: ``OFF``)
+  Enable usage of ``mpi_f08`` module for MPI. Requires Fortran 2008
+  capable compiler and compatible MPI installation.
+``-DCM_ALL_FORTRAN_ERROR_ABORT=ON`` (default: ``OFF``)
+  Abort execution on any error when using the Fortran interface instead of
+  setting the error number and leaving error handling to the user.
+``-DCM_ALL_VORONOI=ON`` (default: ``OFF``)
+  Enable Voronoi mesh method and subsequently compilation and linkage of
+  Voro++.
+``-DCM_ALL_TESTING=ON`` (default: ``OFF``)
+  Turns on unit and feature tests. Can be run from the build directory
+  with ``ctest``. Requires the Boost test utilities.
+``-DCMAKE_BUILD_TYPE=Debug`` (default: ``Release``)
+  Enable library internal debugging features. Using ``DebugWithOpt`` also
+  turns on some optimizations.
+``-DCM_ALL_DEBUG=ON`` (default: ``OFF``)
+  Enable self consistency checks within the library itself.
+``-DVTK_DIR=$VTK_DIR`` (default: system dependent)
+  Path to an explicit VTK installation. Make sure that VTK is compiled
+  with the same MPI as the library and subsequently your code.
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
diff --git a/docs/Usage.rst b/docs/Usage.rst
new file mode 100644
index 0000000000000000000000000000000000000000..40fbae98b34e3addd7d42db0723333b17ef9c952
--- /dev/null
+++ b/docs/Usage.rst
@@ -0,0 +1,174 @@
+.. _usage:
+
+Usage
+=====
+
+The library is presented as a single load balancing object ``ALL::ALL``
+from ``ALL.hpp``. All classes are encapsulated in the ``ALL`` name space.
+Simple (and not so simple) examples are available in the ``/examples`` sub
+directory.  The simple examples are also documented at length.
+
+Errors are treated as exceptions that are thrown of a (sub) class of
+``ALL::CustomException``. Likely candidates that may throw exceptions are
+``balance``,``setup``, ``printVTKoutlines``.
+
+ALL object
+----------
+The ALL object can be constructed with
+
+.. code-block:: c++
+
+    ALL::ALL<T,W> (
+                   const ALL::LB_t method,
+                   const int dimension,
+                   const T gamma)
+
+Where the type of the boundaries is given by ``T`` and the type of the
+work as ``W``, which are usually float or double. The load balancing
+method must be one of ``ALL::LB_t::TENSOR``, ``...STAGGERED``,
+``...FORCEBASED``, ``...VORONOI``, ``...HISTOGRAM``. Where the Voronoi
+method must be enabled at compile time.  There is also a second form where
+the initial domain vertices are already provided:
+
+.. code-block:: c++
+
+    ALL::ALL<T,W> (
+                   const ALL::LB_t method,
+                   const int dimension,
+                   const std::vector<ALL::Point<T>> &initialDomain,
+                   const T gamma)
+
+Point object
+------------
+The domain boundaries are given and retrieved as a vector of
+``ALL::Point`` s.  These can be constructed via
+
+.. code-block:: c++
+
+    ALL::Point<T> ( const int dimension )
+    ALL::Point<T> ( const int dimension, const T *values )
+    ALL::Point<T> ( const std::vector<T> &values )
+
+and its elements can be accessed through the ``[]`` operator like a
+``std::vector``.
+
+Set up
+------
+A few additional parameters must be set before the domains can be
+balanced.  Which exactly need to be set, is dependent on the method used,
+but in general the following are necessary
+
+.. code-block:: c++
+
+    void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices )
+    void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm )
+    void ALL::ALL<T,W>::setWork ( const W work )
+
+If the given communicator is not a cartesian communicator the process grid
+parameters must also be set beforehand (!) using
+
+.. code-block:: c++
+
+    void ALL::ALL<T,W>::setProcGridParams (
+            const std::vector<int> &location,
+            const std::vector<int> &size)
+
+with the location of the current process and number of processes in each
+direction.
+
+To trace the current domain better an integer tag can be provided, which
+is used in the domain output, with
+
+.. code-block:: c++
+
+    void ALL::ALL<T,W>::setProcTag ( int tag )
+
+and an observed minimal domain size in each direction can be set using
+
+.. code-block:: c++
+
+    void ALL::ALL<T,W>::setMinDomainSize (
+            const std::vector<T> &minSize )
+ 
+Once these parameters are set call
+
+.. code-block:: c++
+
+    void ALL::ALL<T,W>::setup()
+
+so internal set up routines are run.
+
+Balancing
+---------
+To create the new domain vertices make sure the current vertices are set
+using ``setVertices`` and the work is set using ``setWork`` then call
+
+.. code-block:: c++
+
+    void ALL::ALL<T,W>::balance()
+
+and then the new vertices can be retrieved by using
+
+.. code-block:: c++
+
+    std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices ()
+
+or new neighbors with
+
+.. code-block:: c++
+
+    std::vector<int> &ALL::ALL<T,W>::getNeighbors ()
+
+Special considerations for the Fortran interface
+------------------------------------------------
+The Fortran interface exists in two versions. Either in the form of
+
+.. code-block:: fortran
+
+    ALL_balance(all_object)
+
+where the all object of type ``type(ALL_t)`` is given as first argument to
+each function and as more object oriented functions to the object itself.
+So the alternative call would be
+
+.. code-block:: fortran
+
+    all_object%balance()
+
+The function names themselves are similar, i.e. instead of camel case they
+are all lowercase except the first ``ALL_`` with underscores between
+words. So ``setProcGridParams`` becomes ``ALL_set_proc_grid_parms``. For
+more usage info please check the Fortran example programs.
+
+One important difference is the handling of MPI types, which change
+between the MPI Fortran 2008 interface as given by the ``mpi_f08`` module
+and previous interfaces. In previous interfaces all MPI types are
+``INTEGER`` s, but using the ``mpi_f08`` module the communicator is of
+``type(MPI_Comm)``. To allow using ``ALL_set_communicator`` directly with
+the communicator used in the user's application, build the library with
+enabled Fortran 2008 features and this communicator type is expected.
+
+Retrieving information from the balancer is also different, since most
+getter return (a reference to) an object itself. The Fortran subroutines
+set the values of its arguments. As an example
+
+.. code-block:: c++
+
+    int work = all.getWork();
+
+becomes
+
+.. code-block:: fortran
+
+    integer(c_int) :: work
+    call ALL_get_work(work) !or
+    !call all%get_work(work)
+
+Since there is no exception handling in Fortran, the error handling from
+the libc is borrowed. So there are additional function that retrieve and
+reset an error number as well as an additional error message describing
+the error. These must be queried and handled by the user. An example of
+this usage is shown in the ``ALL_Staggered_f`` example when printing the
+VTK outlines.
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
diff --git a/docs/conf.py.in b/docs/conf.py.in
index b537c02233d548f85b92e884206ace2144a89be0..98baf97df9e654d8af32f0432aeefe9be688f2d5 100644
--- a/docs/conf.py.in
+++ b/docs/conf.py.in
@@ -13,7 +13,7 @@
 # import os
 # import sys
 # sys.path.insert(0, os.path.abspath('.'))
-from recommonmark.parser import CommonMarkParser
+#from recommonmark.parser import CommonMarkParser
 
 
 # -- Project information -----------------------------------------------------
diff --git a/docs/index.rst b/docs/index.rst
index 63436a5a5e3fdd5333bd90203228af252b5c829c..da41150a5f47aadb57e2bdd6ef582eb91bd22984 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -11,9 +11,15 @@ Welcome to the documentation of ALL
   :caption: Contents:
   :includehidden:
 
+  Install.rst
+  Usage.rst
+  Examples.rst
   api/index.rst
+  methods/index.rst
 
-todo(s.schulz): Link to doxygen documentation.
+
+The full (developer) documentation of the source code as provided by
+doxygen can be found `here <../doxygen/index.html>`_.
 
 Indices and tables
 ==================
diff --git a/docs/methods/Histogram.rst b/docs/methods/Histogram.rst
new file mode 100644
index 0000000000000000000000000000000000000000..9616c6fa3672fc093cd8f3079905f024e9cffb08
--- /dev/null
+++ b/docs/methods/Histogram.rst
@@ -0,0 +1,43 @@
+.. _histogramDetails:
+
+Histogram method
+================
+
+The histogram-based method works differently than the first two methods
+in such a way, that it is a global method that requires three distinct
+steps for a single adjustment. In each of these steps the following
+takes place: a partial histogram needs to be created over the workload,
+e.g. number of particles, along one direction on each domain, then these
+are supplied to the method, where a global histogram is computed. With
+this global histogram a distribution function is created. This is used
+to compute the optimal (possible) width of domains in that direction.
+For the second and third steps the computation of the global histograms
+and distribution functions take place in subsystems, being the results
+of the previous step.  The result is the most optimal distribution of
+domains according to the Staggered-grid method, at the cost of global
+exchange of work, due to the global adjustment, which makes the method
+not well suited to highly dynamic systems, due to the need of frequent
+updates. On the other hand the method is well suited for static
+problems, e.g. grid-based simulations.
+
+Note: Currently the order of dimensions is: z-y-x.
+
+Required number of vertices
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+
+Additional requirements
+ - partial histogram created over the workload on the local domain in
+   the direction of the current correction step
+
+Advantages
+ - supplies an optimal distribution of domains (restricted by width of
+   bins used for the histogram)
+ - only three steps needed to acquire result
+
+Disadvantages
+ - expensive in cost of communication, due to global shifts
+ - requires preparation of histogram and shift of work between each of
+   the three correction steps
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
diff --git a/docs/methods/Staggered.rst b/docs/methods/Staggered.rst
new file mode 100644
index 0000000000000000000000000000000000000000..69b7968f53f5bd79bc8fdc89cce13db4112eee95
--- /dev/null
+++ b/docs/methods/Staggered.rst
@@ -0,0 +1,36 @@
+.. _staggeredDetails:
+
+Staggered method
+================
+
+The staggered grid approach is a hierarchical one. In a first step the
+work of all domains sharing the same cartesian coordinate with respect
+to the highest dimension (z in three dimensions) is collected. Then,
+like in the TENSOR strategy the layer width in this dimension is
+adjusted based on comparison of the collected work with the collected
+work of the neighboring domains in the same dimension. As a second step
+each of these planes is divided into a set of columns, where all domains
+share the same cartesian coordinate in the next lower dimension (y in
+three dimensions). For each of these columns the before described
+procedure is repeated, i.e. work collected and the width of the columns
+adjusted accordingly. Finally, in the last step the work of individual
+domains is compared to direct neighbors and the width in the last
+dimension (x in three dimensions) adjusted. This leads to a staggered
+grid, that is much better suited to describe inhomogeneous work
+distributions than the TENSOR strategy.
+
+Required number of vertices
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+
+Advantages
+ - very good equalization results for the work load
+ - maintains orthogonal domains
+
+Disadvantages
+ - changes topology of the domains and requires adjustment of neighbor
+   relations
+ - communication pattern in the calling code might require adjustment to
+   deal with changing neighbors
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
diff --git a/docs/methods/Tensor.rst b/docs/methods/Tensor.rst
new file mode 100644
index 0000000000000000000000000000000000000000..25bee32a7a7c7dc0536f6d28740ab6807017376d
--- /dev/null
+++ b/docs/methods/Tensor.rst
@@ -0,0 +1,31 @@
+.. _tensorDetails:
+
+Tensor method
+=============
+
+In order to equalize the load of individual domains, the assumption is
+made that this can be achieved by equalizing the work in each cartesian
+direction, i.e. the work of all domains having the same coordinate in a
+cartesian direction is collected and the width of all these domains in
+this direction is adjusted by comparing this collective work with the
+collective work of the neighboring domains. This is done independently
+for each cartesian direction in the system.
+
+Required number of vertices
+ - two, one describing the lower left front point and one describing the
+   upper right back point of the domain
+
+Advantages
+ - topology of the system is maintained (orthogonal domains, neighbor
+   relations)
+ - no update for the neighbor relations is required in the calling
+   code
+ - if a code was able to deal with orthogonal domains, only small
+   changes are expected to include this strategy
+
+Disadvantages
+ - due to the comparison of collective work loads in cartesian layers
+   and restrictions resulting from this construction, the final result
+   might lead to a sub-optimal domain distribution
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us:
diff --git a/docs/methods/index.rst b/docs/methods/index.rst
new file mode 100644
index 0000000000000000000000000000000000000000..098bafab1893bb49794a7f0bcc88b15416f940fd
--- /dev/null
+++ b/docs/methods/index.rst
@@ -0,0 +1,74 @@
+The Loadbalancing Methods
+=========================
+
+.. toctree::
+  :hidden:
+  
+  Tensor.rst
+  Staggered.rst
+  Histogram.rst
+
+A short overview of the methods is given below and more details can be
+found on their respective page.
+
+:ref:`Tensor product<tensorDetails>`
+  The work on all processes is reduced over the cartesian planes in the
+  systems. This work is then equalized by adjusting the borders of the
+  cartesian planes.
+
+:ref:`Staggered grid<staggeredDetails>`
+  A 3-step hierarchical approach is applied, where:
+
+  1. work over the cartesian planes is reduced, before the borders of
+     these planes are adjusted
+  2. in each of the cartesian planes the work is reduced for each
+     cartesian column. These columns are then adjusted to each other to
+     homogenize the work in each column
+  3. the work between neighboring domains in each column is adjusted.
+     Each adjustment is done locally with the neighboring planes,
+     columns or domains by adjusting the adjacent boundaries.
+
+:ref:`Topological mesh<topologicalDetails>` (WIP)
+  In contrast to the previous methods this method adjusts domains not by
+  moving boundaries but vertices, i.e.  corner points, of domains. For
+  each vertex a force, based on the differences in work of the neighboring
+  domains, is computed and the vertex is shifted in a way to equalize the
+  work between these neighboring domains.
+
+:ref:`Voronoi mesh<voronoiDetails>` (WIP)
+  Similar to the topological mesh method, this method computes a force,
+  based on work differences. In contrast to the topological mesh method,
+  the force acts on a Voronoi point rather than a vertex, i.e. a point
+  defining a Voronoi cell, which describes the domain. Consequently, the
+  number of neighbors is not a conserved quantity, i.e. the topology may
+  change over time. ALL uses the Voro++ library published by the Lawrence
+  Berkeley National Laboratory for the generation of the Voronoi mesh.
+
+:ref:`Histogram based staggered grid<histogramDetails>`
+  Resulting in the same grid, as the staggered grid scheme, this scheme
+  uses the cumulative work function in each of the three cartesian
+  directions in order to generate this grid. Using histograms and the
+  previously defined distribution of process domains in a cartesian grid,
+  this scheme generates in three steps a staggered-grid result, in which
+  the work is distributed as evenly as the resolution of the underlying
+  histogram allows. In contrast to the previously mentioned schemes this
+  scheme depends on a global exchange of work between processes.
+
+:ref:`Orthogonal recursive bisection<orthogonalDetails>` (Planned)
+  Comparable to the histogram based staggered grid scheme, this scheme
+  uses cumulative work functions evaluated by histograms in order to find
+  a new distribution of workload in a hierarchical manner. In contrast to
+  the histogram based staggered grid scheme, the subdivision of domains is
+  not based on a cartesian division of processes, but on the prime
+  factorization of the total number of processes. During each step, the
+  current slice of the system is distributed into smaller sub-slices along
+  the longest edge, that roughly contain the same workload. The number of
+  sub-slices in then determined by the corresponding prime number of that
+  step, i.e. when trying to establish a distribution for 24 (3 * 2 * 2 *
+  2) processes, the bisection would use four steps. In the first step the
+  system would be subdivided into 3 sub domains along the largest edge,
+  each containing roughly the same workload.  Each of these sub domains
+  then is divided into two sub sub domains independently. This procedure
+  is then repeated for all the prime numbers in the prime factorization.
+
+.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: