diff --git a/README.md b/README.md index c634055fa0b33e4ee2dc38a4fc6c8132fc3d866f..459a8d95423ba793af5fd95af014900338aa9ced 100644 --- a/README.md +++ b/README.md @@ -5,62 +5,10 @@ load balancing into particle based simulation codes. The library is developed in the Simulation Laboratory Molecular Systems of the Jülich Supercomputing Centre at Forschungszentrum Jülich. -It includes several load balancing schemes. The following list gives an -overview about these schemes and short explanations. - - - **Tensor product**: The work on all processes is reduced over the - cartesian planes in the systems. This work is then equalized by - adjusting the borders of the cartesian planes. - - **Staggered grid**: A 3-step hierarchical approach is applied, where: - 1. work over the cartesian planes is reduced, before the borders of - these planes are adjusted - 2. in each of the cartesian planes the work is reduced for each - cartesian column. These columns are then adjusted to each other to - homogenize the work in each column - 3. the work between neighboring domains in each column is adjusted. - Each adjustment is done locally with the neighboring planes, - columns or domains by adjusting the adjacent boundaries. - - **Topological mesh** (WIP): In contrast to the previous methods this - method adjusts domains not by moving boundaries but vertices, i.e. - corner points, of domains. For each vertex a force, based on the - differences in work of the neighboring domains, is computed and the - vertex is shifted in a way to equalize the work between these - neighboring domains. - - **Voronoi mesh** (WIP): Similar to the topological mesh method, this - method computes a force, based on work differences. In contrast to - the topological mesh method, the force acts on a Voronoi point rather - than a vertex, i.e. a point defining a Voronoi cell, which describes - the domain. Consequently, the number of neighbors is not a conserved - quantity, i.e. the topology may change over time. ALL uses the Voro++ - library published by the Lawrence Berkeley National Laboratory for the - generation of the Voronoi mesh. - - **Histogram based staggered grid**: Resulting in the same grid, as - the staggered grid scheme, this scheme uses the cumulative work - function in each of the three cartesian directions in order to - generate this grid. Using histograms and the previously defined - distribution of process domains in a cartesian grid, this scheme - generates in three steps a staggered-grid result, in which the work - is distributed as evenly as the resolution of the underlying - histogram allows. In contrast to the previously mentioned schemes - this scheme depends on a global exchange of work between processes. - - **Orthogonal recursive bisection** (Planned): Comparable to the - histogram based staggered grid scheme, this scheme uses cumulative - work functions evaluated by histograms in order to find a new - distribution of workload in a hierarchical manner. In contrast to the - histogram based staggered grid scheme, the subdivision of domains is - not based on a cartesian division of processes, but on the prime - factorization of the total number of processes. During each step, the - current slice of the system is distributed into smaller sub-slices - along the longest edge, that roughly contain the same workload. The - number of sub-slices in then determined by the corresponding prime - number of that step, i.e. when trying to establish a distribution for - 24 (3 * 2 * 2 * 2) processes, the bisection would use four steps. In - the first step the system would be subdivided into 3 sub domains - along the largest edge, each containing roughly the same workload. - Each of these sub domains then is divided into two sub sub domains - independently. This procedure is then repeated for all the prime - numbers in the prime factorization. - +Only a brief summary is given here and more information can be found in +the [official documentation](docs/index.rst) such as a detailed API +description, examples and further information regarding the load +balancing methods. ## Installation and Requirements @@ -86,337 +34,11 @@ Optional requirements: `$ALL_ROOT_DIR`. 2. Create the build directory `$ALL_BUILD_DIR` some place else. 3. Call `cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"` to set up the - installation. Compile time features can be selected from: - - `-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR` (default: system - dependant): Install into `$ALL_INSTALL_DIR` after compiling. - - `-DCM_ALL_VTK_OUTPUT=ON` (default: `OFF`): Enable VTK output - of domains. Requires VTK 7.1 or higher. - - `-DCM_ALL_FORTRAN=ON` (default: `OFF`): Enable Fortran interface. - Requires Fortran 2003 capable compiler. - - `-DCM_ALL_USE_F08=ON` (default: `OFF`): Enable usage of `mpi_f08` - module for MPI. Requires Fortran 2008 capable compiler and - compatible MPI installation. - - `-DCM_ALL_FORTRAN_ERROR_ABORT=ON` (default: `OFF): Abort - execution on any error when using the Fortran interface instead - of setting the error number and leaving error handling to the - user. - - `-DCM_ALL_VORONOI=ON` (default: `OFF`): Enable Voronoi mesh - method and subsequently compilation and linkage of Voro++. - - `-DCM_ALL_TESTING=ON` (default: `OFF`): Turns on unit and feature - tests. Can be run from the build directory with `ctest`. Requires - the Boost test utilities. - - `-DCMAKE_BUILD_TYPE=Debug` (default: `Release`): Enable library - internal debugging features. Using `DebugWithOpt` also turns on - some optimizations. - - `-DCM_ALL_DEBUG=ON` (default: `OFF`): Enable self consistency - checks within the library itself. - - `-DVTK_DIR=$VTK_DIR` (default: system dependant): Path to an - explicit VTK installation. Make sure that VTK is compiled with - the same MPI as the library and subsequently your code. - To use a specific compiler and Boost installation use: `CC=gcc - CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`. + installation. To use a specific compiler and Boost installation + use: `CC=gcc CXX=g++ BOOST_ROOT=$BOOST_DIR cmake [...]`. 4. To build and install the library then run: `cmake --build "$ALL_BUILD_DIR"` and `cmake --install "$ALL_BUILD_DIR"`. Afterwards, the built examples and library files are placed in `$ALL_INSTALL_DIR`. - -## Usage -The library is presented as a single load balancing object `ALL::ALL` -from `ALL.hpp`. All classes are encapsulated in the `ALL` name space. -Simple (and not so simple) examples are available in the `/examples` sub -directory. The simple examples are also documented at length. - -Errors are treated as exceptions that are thrown of a (sub) class of -`ALL::CustomException`. Likely candidates that may throw exceptions - are `balance`,`setup`, `printVTKoutlines`. - -### ALL object -The ALL object can be constructed with - - ALL::ALL<T,W> ( - const ALL::LB_t method, - const int dimension, - const T gamma) - -Where the type of the boundaries is given by `T` and the type of the -work as `W`, which are usually float or double. The load balancing -method must be one of `ALL::LB_t::TENSOR`, `...STAGGERED`, -`...FORCEBASED`, `...VORONOI`, `...HISTOGRAM`. Where the Voronoi -method must be enabled at compile time. There is also a second form -where the initial domain vertices are already provided: - - ALL::ALL<T,W> ( - const ALL::LB_t method, - const int dimension, - const std::vector<ALL::Point<T>> &initialDomain, - const T gamma) - -### Point object -The domain boundaries are given and retrieved as a vector of -`ALL::Point`s. These can be constructed via - - ALL::Point<T> ( const int dimension ) - ALL::Point<T> ( const int dimension, const T *values ) - ALL::Point<T> ( const std::vector<T> &values ) - -and its elements can be accessed through the `[]` operator like a `std::vector` - -### Set up -A few additional parameters must be set before the domains can be -balanced. Which exactly need to be set, is dependent on the method used, -but in general the following are necessary - - void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices ) - void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm ) - void ALL::ALL<T,W>::setWork ( const W work ) - -If the given communicator is not a cartesian communicator the process -grid parameters must also be set beforehand (!) using - - void ALL::ALL<T,W>::setProcGridParams ( - const std::vector<int> &location, - const std::vector<int> &size) - -with the location of the current process and number of processes in each -direction. - -To trace the current domain better an integer tag can be provided, which is -used in the domain output, with - - void ALL::ALL<T,W>::setProcTag ( int tag ) - -and an observed minimal domain size in each direction can be set using - - void ALL::ALL<T,W>::setMinDomainSize ( - const std::vector<T> &minSize ) - -Once these parameters are set call - - void ALL::ALL<T,W>::setup() - -so internal set up routines are run. - -### Balancing -To create the new domain vertices make sure the current vertices are set -using `setVertices` and the work is set using `setWork` then call - - void ALL::ALL<T,W>::balance() - -and then the new vertices can be retrieved by using - - std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices () - -or new neighbors with - - std::vector<int> &ALL::ALL<T,W>::getNeighbors () - -### Special considerations for the Fortran interface -The Fortran interface exists in two versions. Either in the form of - - ALL_balance(all_object) - -where the all object of type `type(ALL_t)` is given as first argument to -each function and as more object oriented functions to the object -itself. So the alternative call would be - - all_object%balance() - -The function names themselves are similar, i.e. instead of camel case -they are all lowercase except the first `ALL_` with underscores between -words. So `setProcGridParams` becomes `ALL_set_proc_grid_parms`. For -more usage info please check the Fortran example programs. - -One important difference is the handling of MPI types, which change -between the MPI Fortran 2008 interface as given by the `mpi_f08` module -and previous interfaces. In previous interfaces all MPI types are -`INTEGER`s, but using the `mpi_f08` module the communicator is of -`type(MPI_Comm)`. To allow using `ALL_set_communicator` directly with -the communicator used in the user's application, build the library with -enabled Fortran 2008 features and this communicator type is expected. - -Retrieving information from the balancer is also different, since most -getter return (a reference to) an object itself. The Fortran subroutines -set the values of its arguments. As an example - - int work = all.getWork(); - -becomes - - integer(c_int) :: work - call ALL_get_work(work) !or - !call all%get_work(work) - -Since there is no exception handling in Fortran, the error handling from -the libc is borrowed. So there are additional function that retrieve and -reset an error number as well as an additional error message describing -the error. These must be queried and handled by the user. An example of -this usage is shown in the `ALL_Staggered_f` example when printing the -VTK outlines. - -### Details of the balancing methods -#### Tensor based -In order to equalize the load of individual domains, the assumption is -made that this can be achieved by equalizing the work in each cartesian -direction, i.e. the work of all domains having the same coordinate in a -cartesian direction is collected and the width of all these domains in -this direction is adjusted by comparing this collective work with the -collective work of the neighboring domains. This is done independently -for each cartesian direction in the system. - -Required number of vertices: - - two, one describing the lower left front point and one describing the - upper right back point of the domain - -Advantages: - - topology of the system is maintained (orthogonal domains, neighbor - relations) - - no update for the neighbor relations is required in the calling - code - - if a code was able to deal with orthogonal domains, only small - changes are expected to include this strategy - -Disadvantages: - - due to the comparison of collective work loads in cartesian layers - and restrictions resulting from this construction, the final result - might lead to a sub-optimal domain distribution - -#### Staggered grid -The staggered grid approach is a hierarchical one. In a first step the -work of all domains sharing the same cartesian coordinate with respect -to the highest dimension (z in three dimensions) is collected. Then, -like in the TENSOR strategy the layer width in this dimension is -adjusted based on comparison of the collected work with the collected -work of the neighboring domains in the same dimension. As a second step -each of these planes is divided into a set of columns, where all domains -share the same cartesian coordinate in the next lower dimension (y in -three dimensions). For each of these columns the before described -procedure is repeated, i.e. work collected and the width of the columns -adjusted accordingly. Finally, in the last step the work of individual -domains is compared to direct neighbors and the width in the last -dimension (x in three dimensions) adjusted. This leads to a staggered -grid, that is much better suited to describe inhomogeneous work -distributions than the TENSOR strategy. - -Required number of vertices: - - two, one describing the lower left front point and one describing the - upper right back point of the domain - -Advantages: - - very good equalization results for the work load - - maintains orthogonal domains - -Disadvantages: - - changes topology of the domains and requires adjustment of neighbor - relations - - communication pattern in the calling code might require adjustment to - deal with changing neighbors - -#### Histogram based staggered grid - -The histogram-based method works differently than the first two methods -in such a way, that it is a global method that requires three distinct -steps for a single adjustment. In each of these steps the following -takes place: a partial histogram needs to be created over the workload, -e.g. number of particles, along one direction on each domain, then these -are supplied to the method, where a global histogram is computed. With -this global histogram a distribution function is created. This is used -to compute the optimal (possible) width of domains in that direction. -For the second and third steps the computation of the global histograms -and distribution functions take place in subsystems, being the results -of the previous step. The result is the most optimal distribution of -domains according to the Staggered-grid method, at the cost of global -exchange of work, due to the global adjustment, which makes the method -not well suited to highly dynamic systems, due to the need of frequent -updates. On the other hand the method is well suited for static -problems, e.g. grid-based simulations. - -Note: Currently the order of dimensions is: z-y-x. - -Required number of vertices: - - two, one describing the lower left front point and one describing the - upper right back point of the domain - -Additional requirements: - - partial histogram created over the workload on the local domain in - the direction of the current correction step - -Advantages: - - supplies an optimal distribution of domains (restricted by width of - bins used for the histogram) - - only three steps needed to acquire result - -Disadvantages: - - expensive in cost of communication, due to global shifts - - requires preparation of histogram and shift of work between each of - the three correction steps - - -## Example programs -In the `/examples` sub directory several C++ and Fortran example -programs are placed. - -### `ALL_test` -MPI C++ code that generates a particle distribution over the domains. At -program start the domains form an orthogonal decomposition of the cubic -3d system. Each domain has the same volume as each other domain. -Depending on the cartesian coordinates of the domain in this -decomposition, a number of points is created on each domain. The points -are then distributed uniformly over the domain. For the number of points -a polynomial formula is used to create a non-uniform point distribution. -As an estimation of the work load in this program the number of points -within a domain was chosen. There is no particle movement included in -the current version of the example code, therefore particle -communication only takes place due to the shift of domain borders. - -The program creates three output files in its basic version: - - - `minmax.dat`: - - **Columns**: `<iteration count> <W_min/W_avg> <W_max_W_avg> - <(W_max-W_min)/(W_max+W_min)>` - - **Explanation**: In order to try to give a comparable metric for - the success of the load balancing procedure the relatative - difference of the minimum and maximum work loads in the system in - relation to the average work load of the system are given. The - last column gives an indicator for the inequality of work in the - system, in a perfectly balanced system, this value should be equal - to zero. - - - `stddev.txt`: - - **Columns**: `<iteration count> <std_dev>` - - **Explanation**: The standard deviation of work load over all - domains. - - - `domain_data.dat`: - - **Columns**: `<rank> <x_min> <x_max> <y_min> <y_max> <z_min> - <z_max> <W>` - - **Explanation**: There are two blocks of rows, looking like this, - the first block is the starting configuration of the domains, - which should be a uniform grid of orthogonal domains. The second - block is the configuration after 200 load balancing steps. In each - line the MPI rank of the domain and the extension in each - cartesian direction is given, as well as the work load (the number - of points). - -If the library is compiled with VTK support, `ALL_test` also creates a VTK -based output of the domains in each iteration step. This output will be -placed in a `vtk_outline` sub directory, which is created if it does not -already exist. The resulting output can be visualized with tools like -ParaView or VisIt. - -### `ALL_test_f` and `ALL_test_f_obj` -The Fortran example provides a more basic version of the test program -`ALL_test`, as its main goal is to show the functionality of the Fortran -interface. The code creates a basic uniform orthogonal domain -decomposition and creates an inhomogeneous particle distribution over -these. Only one load balancing step is executed and the program prints -out the domain distribution of the start configuration and of the final -configuration. - -### `ALL_Staggered` and `ALL_Staggered_f` -These create a very simple load balancing situation with fixed load and -domains placed along the z direction. The C++ and Fortran versions are -functionally the same, so the differences between the interfaces can be -checked. - <!-- vim: set tw=72 et ts=4 spell spelllang=en_us ft=markdown: --> diff --git a/docs/Examples.rst b/docs/Examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..84c5e396cdf83e7bd40c866dc433ff1466e5d4a9 --- /dev/null +++ b/docs/Examples.rst @@ -0,0 +1,76 @@ +.. _examples: + +Examples +======== +In the ``/examples`` sub directory several C++ and Fortran example +programs are placed. + +``ALL_test`` +------------ +MPI C++ code that generates a particle distribution over the domains. At +program start the domains form an orthogonal decomposition of the cubic 3d +system. Each domain has the same volume as each other domain. Depending +on the cartesian coordinates of the domain in this decomposition, a number +of points is created on each domain. The points are then distributed +uniformly over the domain. For the number of points a polynomial formula +is used to create a non-uniform point distribution. As an estimation of +the work load in this program the number of points within a domain was +chosen. There is no particle movement included in the current version of +the example code, therefore particle communication only takes place due to +the shift of domain borders. + +The program creates three output files in its basic version: + +``minmax.dat`` + Columns + ``<iteration count> <W_min/W_avg> <W_max_W_avg> + <(W_max-W_min)/(W_max+W_min)>`` + Explanation + In order to try to give a comparable metric for the success of the + load balancing procedure the relative difference of the minimum and + maximum work loads in the system in relation to the average work load + of the system are given. The last column gives an indicator for the + inequality of work in the system, in a perfectly balanced system, this + value should be equal to zero. + +``stddev.txt`` + Columns + ``<iteration count> <std_dev>`` + Explanation + The standard deviation of work load over all domains. + +``domain_data.dat`` + Columns + ``<rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W>`` + Explanation + There are two blocks of rows, looking like this, the first block is + the starting configuration of the domains, which should be a uniform + grid of orthogonal domains. The second block is the configuration + after 200 load balancing steps. In each line the MPI rank of the + domain and the extension in each cartesian direction is given, as well + as the work load (the number of points). + +If the library is compiled with VTK support, ``ALL_test`` also creates a +VTK based output of the domains in each iteration step. This output will +be placed in a ``vtk_outline`` sub directory, which is created if it does +not already exist. The resulting output can be visualized with tools like +ParaView or VisIt. + +``ALL_test_f`` and ``ALL_test_f_obj`` +------------------------------------- +The Fortran example provides a more basic version of the test program +``ALL_test``, as its main goal is to show the functionality of the Fortran +interface. The code creates a basic uniform orthogonal domain +decomposition and creates an inhomogeneous particle distribution over +these. Only one load balancing step is executed and the program prints +out the domain distribution of the start configuration and of the final +configuration. + +``ALL_Staggered`` and ``ALL_Staggered_f`` +----------------------------------------- +These create a very simple load balancing situation with fixed load and +domains placed along the z direction. The C++ and Fortran versions are +functionally the same, so the differences between the interfaces can be +checked. + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: diff --git a/docs/Install.rst b/docs/Install.rst new file mode 100644 index 0000000000000000000000000000000000000000..af003718a2b9d4637decfc4d3c56918842291cff --- /dev/null +++ b/docs/Install.rst @@ -0,0 +1,65 @@ +.. _install: + +Installation +============ + +Requirements +------------ + +Base requirements + - C++11 capable compiler + - MPI support + - CMake v.3.14 or higher + +Optional requirements + - Fortran 2003 capable compiler (Fortran interface) + - Fortran 2008 capable compiler (Usage of the ``mpi_f08`` interface) + - VTK 7.1 or higher (Domain output) + - Boost testing utilities + - Doxygen and Sphinx with ``breathe`` (Documentation) + +Installation +------------ + +1. Clone the library from + ``https://gitlab.version.fz-juelich.de/SLMS/loadbalancing`` into + ``$ALL_ROOT_DIR``. +2. Create the build directory ``$ALL_BUILD_DIR`` some place else. +3. Call ``cmake -S "$ALL_ROOT_DIR" -B "$ALL_BUILD_DIR"`` to set up the + installation. Other compile time features should be set here. To use a + specific compiler and Boost installation use: ``CC=gcc CXX=g++ + BOOST_ROOT=$BOOST_DIR cmake [...]``. +4. To build and install the library then run: ``cmake --build + "$ALL_BUILD_DIR"`` and ``cmake --install "$ALL_BUILD_DIR"``. Afterwards, + the built examples and library files are placed in ``$ALL_INSTALL_DIR``. + +Compile time features +********************* +``-DCMAKE_INSTALL_PREFIX=$ALL_INSTALL_DIR`` (default: system dependent) + Install into ``$ALL_INSTALL_DIR`` after compiling. +``-DCM_ALL_VTK_OUTPUT=ON`` (default: ``OFF``) + Enable VTK output of domains. Requires VTK 7.1 or higher. +``-DCM_ALL_FORTRAN=ON`` (default: ``OFF``) + Enable Fortran interface. Requires Fortran 2003 capable compiler. +``-DCM_ALL_USE_F08=ON`` (default: ``OFF``) + Enable usage of ``mpi_f08`` module for MPI. Requires Fortran 2008 + capable compiler and compatible MPI installation. +``-DCM_ALL_FORTRAN_ERROR_ABORT=ON`` (default: ``OFF``) + Abort execution on any error when using the Fortran interface instead of + setting the error number and leaving error handling to the user. +``-DCM_ALL_VORONOI=ON`` (default: ``OFF``) + Enable Voronoi mesh method and subsequently compilation and linkage of + Voro++. +``-DCM_ALL_TESTING=ON`` (default: ``OFF``) + Turns on unit and feature tests. Can be run from the build directory + with ``ctest``. Requires the Boost test utilities. +``-DCMAKE_BUILD_TYPE=Debug`` (default: ``Release``) + Enable library internal debugging features. Using ``DebugWithOpt`` also + turns on some optimizations. +``-DCM_ALL_DEBUG=ON`` (default: ``OFF``) + Enable self consistency checks within the library itself. +``-DVTK_DIR=$VTK_DIR`` (default: system dependent) + Path to an explicit VTK installation. Make sure that VTK is compiled + with the same MPI as the library and subsequently your code. + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: diff --git a/docs/Usage.rst b/docs/Usage.rst new file mode 100644 index 0000000000000000000000000000000000000000..40fbae98b34e3addd7d42db0723333b17ef9c952 --- /dev/null +++ b/docs/Usage.rst @@ -0,0 +1,174 @@ +.. _usage: + +Usage +===== + +The library is presented as a single load balancing object ``ALL::ALL`` +from ``ALL.hpp``. All classes are encapsulated in the ``ALL`` name space. +Simple (and not so simple) examples are available in the ``/examples`` sub +directory. The simple examples are also documented at length. + +Errors are treated as exceptions that are thrown of a (sub) class of +``ALL::CustomException``. Likely candidates that may throw exceptions are +``balance``,``setup``, ``printVTKoutlines``. + +ALL object +---------- +The ALL object can be constructed with + +.. code-block:: c++ + + ALL::ALL<T,W> ( + const ALL::LB_t method, + const int dimension, + const T gamma) + +Where the type of the boundaries is given by ``T`` and the type of the +work as ``W``, which are usually float or double. The load balancing +method must be one of ``ALL::LB_t::TENSOR``, ``...STAGGERED``, +``...FORCEBASED``, ``...VORONOI``, ``...HISTOGRAM``. Where the Voronoi +method must be enabled at compile time. There is also a second form where +the initial domain vertices are already provided: + +.. code-block:: c++ + + ALL::ALL<T,W> ( + const ALL::LB_t method, + const int dimension, + const std::vector<ALL::Point<T>> &initialDomain, + const T gamma) + +Point object +------------ +The domain boundaries are given and retrieved as a vector of +``ALL::Point`` s. These can be constructed via + +.. code-block:: c++ + + ALL::Point<T> ( const int dimension ) + ALL::Point<T> ( const int dimension, const T *values ) + ALL::Point<T> ( const std::vector<T> &values ) + +and its elements can be accessed through the ``[]`` operator like a +``std::vector``. + +Set up +------ +A few additional parameters must be set before the domains can be +balanced. Which exactly need to be set, is dependent on the method used, +but in general the following are necessary + +.. code-block:: c++ + + void ALL::ALL<T,W>::setVertices ( std::vector<ALL::Point<T>> &vertices ) + void ALL::ALL<T,W>::setCommunicator ( MPI_Comm comm ) + void ALL::ALL<T,W>::setWork ( const W work ) + +If the given communicator is not a cartesian communicator the process grid +parameters must also be set beforehand (!) using + +.. code-block:: c++ + + void ALL::ALL<T,W>::setProcGridParams ( + const std::vector<int> &location, + const std::vector<int> &size) + +with the location of the current process and number of processes in each +direction. + +To trace the current domain better an integer tag can be provided, which +is used in the domain output, with + +.. code-block:: c++ + + void ALL::ALL<T,W>::setProcTag ( int tag ) + +and an observed minimal domain size in each direction can be set using + +.. code-block:: c++ + + void ALL::ALL<T,W>::setMinDomainSize ( + const std::vector<T> &minSize ) + +Once these parameters are set call + +.. code-block:: c++ + + void ALL::ALL<T,W>::setup() + +so internal set up routines are run. + +Balancing +--------- +To create the new domain vertices make sure the current vertices are set +using ``setVertices`` and the work is set using ``setWork`` then call + +.. code-block:: c++ + + void ALL::ALL<T,W>::balance() + +and then the new vertices can be retrieved by using + +.. code-block:: c++ + + std::vector<ALL::Point<T>> &ALL::ALL<T,W>::getVertices () + +or new neighbors with + +.. code-block:: c++ + + std::vector<int> &ALL::ALL<T,W>::getNeighbors () + +Special considerations for the Fortran interface +------------------------------------------------ +The Fortran interface exists in two versions. Either in the form of + +.. code-block:: fortran + + ALL_balance(all_object) + +where the all object of type ``type(ALL_t)`` is given as first argument to +each function and as more object oriented functions to the object itself. +So the alternative call would be + +.. code-block:: fortran + + all_object%balance() + +The function names themselves are similar, i.e. instead of camel case they +are all lowercase except the first ``ALL_`` with underscores between +words. So ``setProcGridParams`` becomes ``ALL_set_proc_grid_parms``. For +more usage info please check the Fortran example programs. + +One important difference is the handling of MPI types, which change +between the MPI Fortran 2008 interface as given by the ``mpi_f08`` module +and previous interfaces. In previous interfaces all MPI types are +``INTEGER`` s, but using the ``mpi_f08`` module the communicator is of +``type(MPI_Comm)``. To allow using ``ALL_set_communicator`` directly with +the communicator used in the user's application, build the library with +enabled Fortran 2008 features and this communicator type is expected. + +Retrieving information from the balancer is also different, since most +getter return (a reference to) an object itself. The Fortran subroutines +set the values of its arguments. As an example + +.. code-block:: c++ + + int work = all.getWork(); + +becomes + +.. code-block:: fortran + + integer(c_int) :: work + call ALL_get_work(work) !or + !call all%get_work(work) + +Since there is no exception handling in Fortran, the error handling from +the libc is borrowed. So there are additional function that retrieve and +reset an error number as well as an additional error message describing +the error. These must be queried and handled by the user. An example of +this usage is shown in the ``ALL_Staggered_f`` example when printing the +VTK outlines. + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: diff --git a/docs/index.rst b/docs/index.rst index 63436a5a5e3fdd5333bd90203228af252b5c829c..da41150a5f47aadb57e2bdd6ef582eb91bd22984 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -11,9 +11,15 @@ Welcome to the documentation of ALL :caption: Contents: :includehidden: + Install.rst + Usage.rst + Examples.rst api/index.rst + methods/index.rst -todo(s.schulz): Link to doxygen documentation. + +The full (developer) documentation of the source code as provided by +doxygen can be found `here <../doxygen/index.html>`_. Indices and tables ================== diff --git a/docs/methods/Histogram.rst b/docs/methods/Histogram.rst new file mode 100644 index 0000000000000000000000000000000000000000..9616c6fa3672fc093cd8f3079905f024e9cffb08 --- /dev/null +++ b/docs/methods/Histogram.rst @@ -0,0 +1,43 @@ +.. _histogramDetails: + +Histogram method +================ + +The histogram-based method works differently than the first two methods +in such a way, that it is a global method that requires three distinct +steps for a single adjustment. In each of these steps the following +takes place: a partial histogram needs to be created over the workload, +e.g. number of particles, along one direction on each domain, then these +are supplied to the method, where a global histogram is computed. With +this global histogram a distribution function is created. This is used +to compute the optimal (possible) width of domains in that direction. +For the second and third steps the computation of the global histograms +and distribution functions take place in subsystems, being the results +of the previous step. The result is the most optimal distribution of +domains according to the Staggered-grid method, at the cost of global +exchange of work, due to the global adjustment, which makes the method +not well suited to highly dynamic systems, due to the need of frequent +updates. On the other hand the method is well suited for static +problems, e.g. grid-based simulations. + +Note: Currently the order of dimensions is: z-y-x. + +Required number of vertices + - two, one describing the lower left front point and one describing the + upper right back point of the domain + +Additional requirements + - partial histogram created over the workload on the local domain in + the direction of the current correction step + +Advantages + - supplies an optimal distribution of domains (restricted by width of + bins used for the histogram) + - only three steps needed to acquire result + +Disadvantages + - expensive in cost of communication, due to global shifts + - requires preparation of histogram and shift of work between each of + the three correction steps + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: diff --git a/docs/methods/Staggered.rst b/docs/methods/Staggered.rst new file mode 100644 index 0000000000000000000000000000000000000000..69b7968f53f5bd79bc8fdc89cce13db4112eee95 --- /dev/null +++ b/docs/methods/Staggered.rst @@ -0,0 +1,36 @@ +.. _staggeredDetails: + +Staggered method +================ + +The staggered grid approach is a hierarchical one. In a first step the +work of all domains sharing the same cartesian coordinate with respect +to the highest dimension (z in three dimensions) is collected. Then, +like in the TENSOR strategy the layer width in this dimension is +adjusted based on comparison of the collected work with the collected +work of the neighboring domains in the same dimension. As a second step +each of these planes is divided into a set of columns, where all domains +share the same cartesian coordinate in the next lower dimension (y in +three dimensions). For each of these columns the before described +procedure is repeated, i.e. work collected and the width of the columns +adjusted accordingly. Finally, in the last step the work of individual +domains is compared to direct neighbors and the width in the last +dimension (x in three dimensions) adjusted. This leads to a staggered +grid, that is much better suited to describe inhomogeneous work +distributions than the TENSOR strategy. + +Required number of vertices + - two, one describing the lower left front point and one describing the + upper right back point of the domain + +Advantages + - very good equalization results for the work load + - maintains orthogonal domains + +Disadvantages + - changes topology of the domains and requires adjustment of neighbor + relations + - communication pattern in the calling code might require adjustment to + deal with changing neighbors + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: diff --git a/docs/methods/Tensor.rst b/docs/methods/Tensor.rst new file mode 100644 index 0000000000000000000000000000000000000000..25bee32a7a7c7dc0536f6d28740ab6807017376d --- /dev/null +++ b/docs/methods/Tensor.rst @@ -0,0 +1,31 @@ +.. _tensorDetails: + +Tensor method +============= + +In order to equalize the load of individual domains, the assumption is +made that this can be achieved by equalizing the work in each cartesian +direction, i.e. the work of all domains having the same coordinate in a +cartesian direction is collected and the width of all these domains in +this direction is adjusted by comparing this collective work with the +collective work of the neighboring domains. This is done independently +for each cartesian direction in the system. + +Required number of vertices + - two, one describing the lower left front point and one describing the + upper right back point of the domain + +Advantages + - topology of the system is maintained (orthogonal domains, neighbor + relations) + - no update for the neighbor relations is required in the calling + code + - if a code was able to deal with orthogonal domains, only small + changes are expected to include this strategy + +Disadvantages + - due to the comparison of collective work loads in cartesian layers + and restrictions resulting from this construction, the final result + might lead to a sub-optimal domain distribution + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: diff --git a/docs/methods/index.rst b/docs/methods/index.rst new file mode 100644 index 0000000000000000000000000000000000000000..098bafab1893bb49794a7f0bcc88b15416f940fd --- /dev/null +++ b/docs/methods/index.rst @@ -0,0 +1,74 @@ +The Loadbalancing Methods +========================= + +.. toctree:: + :hidden: + + Tensor.rst + Staggered.rst + Histogram.rst + +A short overview of the methods is given below and more details can be +found on their respective page. + +:ref:`Tensor product<tensorDetails>` + The work on all processes is reduced over the cartesian planes in the + systems. This work is then equalized by adjusting the borders of the + cartesian planes. + +:ref:`Staggered grid<staggeredDetails>` + A 3-step hierarchical approach is applied, where: + + 1. work over the cartesian planes is reduced, before the borders of + these planes are adjusted + 2. in each of the cartesian planes the work is reduced for each + cartesian column. These columns are then adjusted to each other to + homogenize the work in each column + 3. the work between neighboring domains in each column is adjusted. + Each adjustment is done locally with the neighboring planes, + columns or domains by adjusting the adjacent boundaries. + +:ref:`Topological mesh<topologicalDetails>` (WIP) + In contrast to the previous methods this method adjusts domains not by + moving boundaries but vertices, i.e. corner points, of domains. For + each vertex a force, based on the differences in work of the neighboring + domains, is computed and the vertex is shifted in a way to equalize the + work between these neighboring domains. + +:ref:`Voronoi mesh<voronoiDetails>` (WIP) + Similar to the topological mesh method, this method computes a force, + based on work differences. In contrast to the topological mesh method, + the force acts on a Voronoi point rather than a vertex, i.e. a point + defining a Voronoi cell, which describes the domain. Consequently, the + number of neighbors is not a conserved quantity, i.e. the topology may + change over time. ALL uses the Voro++ library published by the Lawrence + Berkeley National Laboratory for the generation of the Voronoi mesh. + +:ref:`Histogram based staggered grid<histogramDetails>` + Resulting in the same grid, as the staggered grid scheme, this scheme + uses the cumulative work function in each of the three cartesian + directions in order to generate this grid. Using histograms and the + previously defined distribution of process domains in a cartesian grid, + this scheme generates in three steps a staggered-grid result, in which + the work is distributed as evenly as the resolution of the underlying + histogram allows. In contrast to the previously mentioned schemes this + scheme depends on a global exchange of work between processes. + +:ref:`Orthogonal recursive bisection<orthogonalDetails>` (Planned) + Comparable to the histogram based staggered grid scheme, this scheme + uses cumulative work functions evaluated by histograms in order to find + a new distribution of workload in a hierarchical manner. In contrast to + the histogram based staggered grid scheme, the subdivision of domains is + not based on a cartesian division of processes, but on the prime + factorization of the total number of processes. During each step, the + current slice of the system is distributed into smaller sub-slices along + the longest edge, that roughly contain the same workload. The number of + sub-slices in then determined by the corresponding prime number of that + step, i.e. when trying to establish a distribution for 24 (3 * 2 * 2 * + 2) processes, the bisection would use four steps. In the first step the + system would be subdivided into 3 sub domains along the largest edge, + each containing roughly the same workload. Each of these sub domains + then is divided into two sub sub domains independently. This procedure + is then repeated for all the prime numbers in the prime factorization. + +.. vim: et sw=2 ts=2 tw=74 spell spelllang=en_us: