diff --git a/README.md b/README.md deleted file mode 100644 index 002c9b12a67799397641df13ecde08e9737527e4..0000000000000000000000000000000000000000 --- a/README.md +++ /dev/null @@ -1,354 +0,0 @@ -A Load Balancing Library (ALL) - -The library aims to provide an easy way to include dynamic domain-based load balancing -into particle based simulation codes. The library is developed in the Simulation Laboratory Molecular Systems of the Juelich Supercomputing Centre at Forschungszentrum Juelich. - -It includes or is going to include several load-balancing schemes. The following list gives an overview about these schemes and short explanations ([o]: scheme fully included, [w]: implementation ongoing, [p]: implementation planned) - -- [o] Tensor-Product: -The work on all processes is reduced over the cartesian -planes in the systems. This work is then equalized by -adjusting the borders of the cartesian planes. - -- [o] Staggered-grid: -A 3-step hierarchical approach is applied, where: -(i) work over the cartesian planes is reduced, before the borders of these -planes are adjusted; (ii) in each of the cartesian planes the work is -reduced for each cartesian column. These columns are then adjusted to each -other to homogenize the work in each column; (iii) the work between -neighboring domains in each column is adjusted. Each adjustment is done -locally with the neighboring planes, columns or domains by adjusting the -adjacent boundaries. - -- [w] Topological Mesh: -In contrast to the previous methods this method adjusts -domains not by moving boundaries but vertices, i.e. corner points, of -domains. For each vertex a force, based on the differences -in work of the neighboring domains, is computed and the -vertex is shifted in a way to equalize the work between these -neighboring domains. - -- [w] Voronoi Mesh: -Similar to the topological mesh method, this method computes a -force, based on work differences. In contrast to the topological mesh -method, the force acts on a Voronoi point rather than a vertex, i.e. a -point defining a Voronoi cell, which describes the domain. Consequently, -the number of neighbors is not a conserved quantity, i.e. the topology -may change over time. ALL uses the Voro++ library published by the -Lawrance Berkeley Laboratory for the generation of the Voronoi mesh. - -- [o] Histogram-based Staggered-grid: -Resulting in the same grid, as the staggered-grid scheme, this scheme uses -the cumulative work function in each of the three cartesian directions in -order to generate this grid. Using histograms and the previously defined -distribution of process domains in a cartesian grid, this scheme generates -in three steps a staggered-grid result, in which the work is distributed as -evenly as the resolution of the underlying histogram allows. In constrast to -the previously mentioned schemes this scheme depends on a global exchange of -work between processes. - -- [p] Orthogonal Recursive Bisection -Comparable to the Histogram-based Staggered-grid scheme, this scheme uses -cumulative work functions evaluated by histograms in order to find a new -distribution of workload in a hierarchical manner. In constrast to the -Histogram-based Staggered-grid scheme, the subdivision of domains is not -based on a cartesian division of processes, but on the prime factorization -of the total number of processes. During each step, the current slice of the -system is distributed into smaller sub-slices along the longest edge, that -roughly contain the same workload. The number of sub-slices in then determined -by the corresponding prime number of that step, i.e. when trying to establish -a distribution for 24 (3 * 2 * 2 * 2) processes, the bisection would use four -steps. In the first step the system would be subdivided into 3 subdomains along -the largest edge, each containing roughly the same workload. Each of these subdomains -then is divided into two subsubdomains indepenently. This procedure is then repeated -for all the prime numbers in the prime factorisation. - - -Installation & Requirements: - - Base requirements: - C++11 - MPI - CMake 3.14+ - - Optional requirements: - Fortran 2003 - Fortran 2008 (if the more modern Fortran MPI interface is to be used) - VTK 7.1+ (for VTK-based output of the domains) - Boost testing utilities - Doxygen - Sphinx (with breathe extension) - - Installation: - - 1.) Either clone the library from - https://gitlab.version.fz-juelich.de/SLMS/loadbalancing - or download it from the same location into a directory on - your system ($ALL_ROOT_DIR). - - 2.) Create a build-directory $ALL_BUILD_DIR in $ALL_ROOT_DIR, change into $ALL_BUILD_DIR and call - cmake $ALL_ROOT_DIR <options> - which sets up the installation. There are some optional features, - you can set up with the following options: - -DCMAKE_INSTALL_PREFIX=<$ALL_INSTALL_DIR> [default: depends on system] - sets the directory $ALL_INSTALL_DIR, into which ?make install? copies the compiled library - and examples - -DCM_ALL_VTK_OUTPUT=ON/OFF [default: OFF] - enables/disables the VTK based output of the domains (requires VTK 7.1+) - -DCM_ALL_FORTRAN=ON/OFF [default: OFF] - compiles the Fortran interface and example - -DCM_ALL_VORONOI=ON/OFF [default: OFF] - activates the compilation of the Voronoi-mesh scheme (and the compilation of Voro++) - -DCM_ALL_USE_F08=ON/OFF [default: OFF] - activates the Fortran 2008 MPI interface - -DCM_ALL_TESTING=ON/OFF [default: OFF] - activates unit tests, which can be run with make test - - 3.) Execute make to compile and install the library to the previously set - directory: - make - make install - - After "make install" the compiled library and the compiled examples are located in the directory $ALL_INSTALL_DIR. - -Usage: - - ALL uses C++ template programming to deal with different data types that describe domain - boundaries and domain based work. In order to capsulate the data, ALL uses a class in which - required data and the computed results of a load-balancing routine are saved and can be - accessed from. To include the library to an existing code, you need to do the following steps: - - 1.) Create an object of the load-balancing class: - - ALL<T,W> () - ALL<T,W> ( int dimension, T gamma ) - ALL<T,W> ( int dimension, std::vector<ALL_Point<T>> vertices, T gamma ) - - As mentioned before the library uses template programming, where T is the data type used - to describe boundaries between domains and vertices (usually float or double) and W is the - data type used to describe the work-load of a process (usually float or double). - The first version of the constructor defines a base object of the load-balancing class that - contains no data, using a three-dimensional system. The second constructor sets up the - system dimension (currently only three-dimensional systems are supported) and the relaxation - parameter gamma, which controls the convergence of the load-balancing methods. In the third described version - of the ALL<T,W> constructor a set of vertices describing the local domain is already passed, using the - ALL_Point Class, described below. - - ALL_Point<T>( const int dimension ) - ALL_Point<T>( const int dimension, const T* values ) - ALL_Point<T>( const std::vector<T>& values ) - - T& ALL_Point<T>::operator[]( const int index ) - - ALL_Point is a class describing a point in space, where the dimension of the space is given - by the input parameter. It can be initialized by either using an array of datatype T or a - std::vector<T>. In the latter case the passing of a dimension is not required and the dimension - of the point is derived from the length of the std::vector. For the initialization with an - array the user has to check that the passed array is of sufficient length, i.e. of length - dimension (or longer). - To update or access initialized ALL_Point objects, the [] operator is overloaded and the n-th - component of a ALL_Point object ?p? can be updated or accessed by p[n]. - - 2.) Setup basic parameters of the system: - - There are three input parameters that are required for the library: - - a) vertices describing the local domain - b) work load for the local domain - c) cartesian MPI communicator on which the program is executed - (requirement of cartesian communicator is under review) - - These parameters can be set with the following methods: - - void ALL<T,W>::set_vertices(std::vector<ALL_Point<T>>& vertices) - void ALL<T,W>::set_vertices(const int n, const int dimension, const T*) - - void ALL<T,W>::set_communicator(MPI_Comm comm) - - void ALL<T,W>::set_work(const W work) - - The vertices can be set by either using a std::vector of ALL_Point data, from which the number - of vertices and the dimension of each point can be derived, or by passing an array of data type - T, which requires that the number of vertices and the dimension of the vertices are also passed. - For the MPI communicator the MPI communicator used by the calling program needs to be passed and - for the work a single value of data type W needs to be passed. - - 3.) Setting up the chosen load balancing routine - - To set up the required internal data structures a call of the following method is required: - - void ALL<T,W>::setup( short method ) - - with ?method? being: - - ALL_LB_t::TENSOR - ALL_LB_t::STAGGERED - ALL_LB_t::HISTOGRAM - - With the keyword method the load balancing strategy is chosen, given by the list above. Starting - point for all methods, described below is a given domain structure (e.g. the one which is initially - set up by the program). In the case of TENSOR and STAGGERED, the domains need to be orthogonal. - For these two methods, the procedure to adjust the work load between domains is a multi-step - approach where in each step, sets of domains are combined into super-domains, the work of which - is mutually adjusted. - - Short overview about the methods: - - ALL_LB_t::TENSOR - - In order to equalize the load of individual domains, the assumption is made that this - can be achieved by equalizing the work in each cartesian direction, i.e. the work - of all domains having the same coordinate in a cartesian direction is collected and - the width of all these domains in this direction is adjusted by comparing this - collective work with the collective work of the neighboring domains. This is done - independently for each cartesian direction in the system. - - Required number of vertices: - two, one describing the lower left front point and one describing the - upper right back point of the domain - - Advantages: - - topology of the system is maintained (orthogonal domains, neighbor relations) - - no update for the neighbor relations need to be made in the calling code - - if a code was able to deal with orthogonal domains, only small changes are - expected to include this strategy - - Disadvantages: - - due to the comparison of collective work loads in cartesian layers and - restrictions resulting from this construction, the final result might - lead to a sub-optimal domain distribution - - ALL_LB_t::STAGGERED: - - The staggered grid approach is a hierarchical one. In a first step the work of all - domains sharing the same cartesian coordinate with respect to the highest dimension - (z in three dimensions) is collected. Then, like in the TENSOR strategy the layer width in - this dimension is adjusted based on comparison of the collected work with the collected - work of the neighboring domains in the same dimension. As a second step each of these planes - is divided into a set of columns, where all domains share the same cartesian coordinate - in the next lower dimension (y in three dimensions). For each of these columns the - before described procedure is repeated, i.e. work collected and the width of the - columns adjusted accordingly. Finally, in the last step the work of individual domains - is compared to direct neighbors and the width in the last dimension (x in three - dimensions) adjusted. This leads to a staggered grid, that is much better suited to - describe inhomogenous work distributions than the TENSOR strategy. - - Required number of vertices: - two, one describing the lower left front point and one describing the - upper right back point of the domain - - Advantages: - - very good equalization results for the work load - - maintains orthogonal domains - - Disadvantages: - - changes topology of the domains and requires adjustment of neighbor relations - - communication pattern in the calling code might require adjustment to deal - with changing neighbors - - ALL_LB_t::HISTOGRAM: - - The histogram-based method works differently than the first two methods in such a way, that it is a - global method that requires three distinct steps for a single adjustment. In each of these steps - the following takes place: a partial histogram needs to be created over the workload, e.g. number of particles, - along one direction on each domain, then these are supplied to the method, where a global histogram - is computed. With this global histogram a distribution function is created. This is used to compute the optimal - (possible) width of domains in that direction. For the second and third steps the computation of the global - histograms and distribution functions take place in subsystems, being the results of the previous step. - The result is the most optimal distribution of domains according to the Staggered-grid method, at the cost of - global exchange of work, due to the global adjustment, which makes the method not well suited to highly dynamic - systems, due to the need of frequent updates. On the other hand the method is well suited for static problems, - e.g. grid-based simulations. - - Note: Currently the order of dimensions is: z-y-x. - - Required number of vertices: - two, one describing the lower left front point and one describing the - upper right back point of the domain - - Additional requirements: - partial histogram created over the workload on the local domain in the direction of the current correction step - - Advantages: - - supplies an optimal distribution of domains (restricted by width of bins used for the histogram) - - only three steps needed to acquire result - - Disadvantages: - - expensive in cost of communication, due to global shifts - - requires preparation of histogram and shift of work between each of the three correction steps - - 4.) Computing new boundaries / vertices - - void ALL<T,W>::balance(short method) - - The balance method starts the computation of the new vertices, according to the chosen method. - The chosen method is required to be the same that was used in the call to the setup method. If - required new neighbor relations are computed as well. - - 5.) Accessing the results - - std::vector<ALL_Point<T>>& ALL<T,W>::get_result_vertices() - - int ALL<T,W> ALL<T,W>::get_n_neighbors(short method) - void ALL<T,W> ALL<T,W>::get_neighbors(short method, int** neighbors) - - void ALL<T,W> ALL<T,W>::get_neighbors(short method, std::vector<int>& neighbors) - - - In order to access the resulting vertices and neighbors, the above three methods can be used. If the - array version of get_neighbors is used, the address to the neighbor array is returned in - neighbors, therefore a int** is passed. - -Examples: - - In the distribution two example codes are included, one written in C++, one written in Fortran. The main - goal of these examples is to show how the library can be included into a particle code. - - ALL_test: - - MPI C++ code that generates a particle distribution over the domains. At program start the domains form - an orthogonal decomposition of the cubic 3d system. Each domain has the same volume as each other domain. - Depending on the cartesian coordinates of the domain in this decomposition, a number of points is created - on each domain. The points are then distributed uniformly over the domain. For the number of points a - polynomial formula is used to create a non-uniform point disribution. As an estimation of the work load - in this program the number of points within a domain was chosen. There is no particle movement included - in the current version of the example code, therefore particle communication only takes place due to the - shift of domain borders. - - The program creates three output files in its basic version: - - minmax.dat: - contains four columns: - <iteration count> <W_min/W_avg> <W_max_W_avg> <(W_max-W_min)/(W_max+W_min)> - explanation: - In order to try to give a comparable metric for the success of the load balancing procedure - the relatative difference of the minimum and maximum work loads in the system in relation - to the average work load of the system are given. The last column gives an indicator for - the inequality of work in the system, in a perfectly balanced system, this value should be - equal to zero. - - stddev.txt: - contains two columns: - <iteration count> <std_dev> - explanation: - The standard deviation of work load over all domains. - - domain_data.dat: - contains two larger sets of data: - <rank> <x_min> <x_max> <y_min> <y_max> <z_min> <z_max> <W> - explanation: - There are two blocks of rows, looking like this, the first block is the starting configuration - of the domains, which should be a uniform grid of orthogonal domains. The second block is the - configuration after 200 load balancing steps. In each line the MPI rank of the domain and the - extension in each cartesian direction is given, as well as the work load (the number of points). - - If the library is compiled with VTK support, ALL_test also creates a VTK based output of the domains in each - iteration step. In order to work, a directory named "vtk_outline" needs to be created in the directory where - the executable is located. The resulting output can be visualized with tools like ParaView or VisIt. - - ALL_test_f: - - The Fortran example provides a more basic version of the test program ALL_test, as its main goal is to show - the functuality of the Fortran interface. The code creates a basic uniform orthogonal domain decomposition - and creates an inhomogenous particle distribution over these. Only one load balancing step is executed and - the program prints out the domain distribution of the start configuration and of the final configuration.