LinkTest can test different process-process combinations in parallel. What connections are tested in parallel to other connections and when is defined by the communication pattern. Given an even number of processes, say 16, LinkTest, in its default configuration, can test 16 of the potential 240 connections by testing between 8 exclusive pairs forwards and backwards. To test the remaining 224 possible connections requires repeating this 14 additional times. Each time LinkTest can test connections in parallel is called a step, so to test the connections between 16 processes takes about 15 steps.
A communication pattern defines which connections are tested together in parallel. Continue with our example of 16 processes, the communication patterns defines which 8 connection pairs are tested in parallel. A common way to generate this pairing is the 1-Factor algorithm, however, given an even number n
processes there are (n-1)!!
possible pairings. For the derivation of these expressions see here.
Furthermore these expression over exaggerate the possible space as two steps in which the same pair order is tested, but the ordering of the pairs is different, are considered to be different communication patterns, when in reality this is not the case. As a result P
should be divided by n!
. We have also likely forgotten to include other effects, as such consider these numbers to be an upper bound.
This potential space is for large n
neigh impossible to iterate through. As a result the 1-factor algorithm is the default generation method for creating communication patterns. Using randomization, see Usage, we can cover a part of this space. More precisely, when using step randomization only the order in which steps are executed is permuted, this results in (n-1)!
different combinations. When randomizing processes IDs the process ID of each process, think MPI rank, is permuted. This can cover n!
different combinations. Furthermore, common pseudo-random number generators have periods of this magnitude and are thus unsuitable for statistical analysis of the possible permutation space we make available to the user. This means that the pseudo-random number generators cannot cover the entire n!
possible permutation space for large n
, e.g. 100.
Bisection Testing
Aside from the 1-factor algorithm LinkTest also offers other options to define the communication pattern. One of these is to do a bisection test.
In a bisection test the even number of n
processes available to LinkTest is split into two bisecting halves. Connection testing is only done between the two halves in n/2
two steps. In each step n/2
connections are tested in parallel. Note that the possible permutation space exhibits a similar behavior as to the full test, i.e. for large n
it becomes neigh impossible to iterate through all possible combinations.
Grouping According to Hostname
LinkTest can also group processes according to their hostname. In this case testing exclusively occurs between all processes of a group and another group during a step, i.e. no two processes from the same group will communicate to different groups. The group-to-group communication pattern is determined in this case using the 1-Factor algorithm. When testing between two groups all possible connection pairs between the two groups are iterated through. Note that the possible permutation space exhibits a similar behavior as to the full test, i.e. for large n
it becomes neigh impossible to iterate through all possible combinations.
Grouping According to Hostname with Bisection
Differs from the previous in that the even number of groups is split into bisecting halves and only the connections between the two halves are tested. Note that the possible permutation space exhibits a similar behavior as to the full test, i.e. for large n
it becomes neigh impossible to iterate through all possible combinations.
Comparison
The above figure compares the communication patterns for 16 processes split into four groups with a common hostname (red, green, blue and magenta) a possible 1-Factor communication scheme. The colored lines between the rounded boxes with numbers, which denote the group-local process number, a tested connection between two process. The color of the line indicates the step in which the connection was tested. Lines with a common color correspond to connections that are tested together in the same step. In the 1-Factor diagram we find connections between all possible nodes. In the hostname-grouping diagram we only find connections between the four differently colored groups. In a given step all processes from one group only communicate with the processes of one other group. In the bisection diagram we only find connections between the upper-left and lower-right set of hosts. The dashed line denotes the bisection split of the processes. The hostname bisection diagram is a combination of the hostname grouping and bisection. Like the bisection diagram communication only occurs across the dashed diagonal, but now in any step all processes from a given group only communicate with the processes of one other group.
Derivation: Number of Communication Patterns
A Linktest communication pattern is defined by a partition of n = 2*x processes into x unordered pairs. There are (n−1)!!=(n−1)(n−3) … 3*1 such partitions.
Proof via induction:
Induction Start x = 1, n = 2, obviously there is only 1 partition. (n-1)!! = 1!! = 1
Induction Thesis a_x = (n-1)!! is the number of partitions of n processes where n = 2*x into x unordered pairs.
Induction Step n+2 processes can be devided as follows.
There are n+1 possibilities to pair the first element with any other. After that there are n elements left. Per induction thesis there are a_x possibilities to finish the partition.
(n+1) * a_x = (n+1)(n-1)!! = (n+1)!! = a_{(x+1)}