All-to-all communication is a commonly occurring communication pattern in which everybody has to communicate something with everybody else. Linktest supports the testing of such communication patterns only for MPI. For an all-to-all communication only the time it took for everyone to finish communicating with everyone else is returned. To turn on all-to-all testing in conjunction with MPI testing please specify the
--alltoall command-line option.
Please note that the exact implementation of how all-to-all communication occurs depends on the used MPI implementation. There are a variety of performant algorithms for all-to-all communication, each with advantages and drawbacks.
A application programming interface facilitates the interaction between different pieces of software, which potentially run on disparate machines. They allow for the communication between software, and by extension between different compute devices.
The act of collecting data to compare things. Linktest benchmarks communication APIs and the associated hardware by measuring how long it takes for a message to be sent back-and-forth between two tasks, which allows for the comparison to the time it takes the same message to be sent back-and-forth between a different pair of tasks or using a different communication API.
--mode command-line option defines which communication API Linktest benchmarks. As a shorthand
-m can be used. See Communication API for a list of supported communication APIs.
Not to be confused with bisection testing. In bidirectional testing messages are sent between tasks asynchronously. Normally Linktest benchmarks communication times by sending a message from one task in pair to the other and then the other sends the same message back. In bidirectional testing both tasks send messages to each other at the same time. This means that neither task weights on the other before sending their message. Such communication is more taxing between two tasks but also commonly faster because neither task has to wait on the other before send their message. Bisection testing can be turned on by specifying the
--bisection command-line option.
Not to be confused with bidirectional testing. In bisection communication testing a population is split into two halves and the communication between the two halves is benchmarked. In Linktest the set of tasks is split into two halves and the communication times for a given message size is benchmarked between the two halves. In Linktest this is done by taking the two halves and iterating over all possible pairs with members from differing sets and timing their back-and-forth communication time for a given message size. This is for example useful for testing cell-to-cell communication performance in hierarchically routed network topologies. Linktest tests bisecting halves of tasks when the
--bisection command-line option is specified.
Please note that the sets are determined at beginning of testing and are never changed. As such a given configuration always results in the same split of tasks into halves. If you wish to have different tasks associated with the two different halves then the task order needs to be changed. This is ideally done when submitting the parallel job for Linktest.
Communication APIs facilitate the communication between different computers by abstracting the underlying necessary hardware commands into easy-to-use portable instructions that can work on a host of different machines. A classical example is MPI.
Linktest can test and benchmark different communication APIs. The communication API that Linktest uses can be controlled via the
--mode command-line option. Alternatively it can be specified by appending it as a suffix to the Linktest-executable name, for example
linktest.mpi, or it can be specified via the
LINKTEST_VCLUSTER_IMPL environment variable.
Linktest supports the following communication APIs:
Note that during Linktest installation only desired supported communication APIs are installed by setting the corresponding environment variable to
1 to install or
0 to not install. As such a given Linktest executable may not support all the listed communication APIs. By default all communication APIs are supported after installation, however this rarely builds successfully as most platforms do not support all communication APIs due to a lack of relevant hardware.
The communication time in Linktest is the time it takes from when a message is ready to be sent till it arrives at the recipient and a receipt is returned. Linktest tests two-way communication times, which is the time it takes between the message being ready to be sent till that message is returned and a receipt is sent. This is referred to as the two-way communication time. As opposed to the one-way communication time which is the time from the message-being ready to be sent till a receipt is received that the message has been successfully delivered. If bidirectional testing is used, both communication partners send their identical messages at the same time and timing ends when a partner receives a receipt.
In a gross oversimplification the communication time consists of two parts, the latency and the transit time. The latency is the time from the message being ready to be sent till sending actually begins. During this time, for example, the connection used to transmit the message is initialized. The transit time is the time it then takes the message to get from its origin to its destination and for a receipt to go back to the destination that the message has been successfully received.
For small message sizes the communication time is dominated by the latency. For large message sizes the communication time is dominated by transmit time, which depends on the communication bandwidth. As such to benchmark communication latency message sizes as small as possible should be used, ideally 0, but messages must have a non-zero message size, as such 1 should be used. To benchmark transmit times, and indirectly bandwidth, as large as possible, although often as large as feasible, message sizes should be used. This ensures that the latency plays a vanishing role in the communication time. Why should message sizes as large as feasible and not as large as possible be used here? The answer is that as message size increases the length of time for the benchmark also grows and too large messages sizes might make Linktest take too long. This is often the case when testing connections serially.
The Random Access Memory (RAM) associated with the Central Processing Units (CPU) of a system, this is usually the main RAM and default RAM Linktest uses to store its messages. However, the dedicated on-card RAM of Graphics Processing Units (GPU) from NVIDIA GPUs can also be used via CUDA. Turning on the option
--use-gpus enables this. Note that Linktest does not keep track of which GPU memory was pinned to, it does not even keep track of which CPU a given Linktest task is executed on. This is the responsibility of the one executing the Linktest benchmark.
The Random Access Memory (RAM) associated with a Graphics Processing Unit (GPU) on a system, this is usually not the main RAM of the system associated with the Central Processing Units (CPU) of the system. Linktest uses the latter RAM by default to store its messages. For NVIDIA GPUs the GPU RAM, however, can also be used to store the Linktest messages via CUDA. Turning on the option
--use-gpus enables this. Pinning Linktest tasks to specific GPUs is required for this. Linktest does not keep track of which GPU memory was pinned to, it does not even keep track of which CPU a given Linktest task is executed on. This is the responsibility of the one executing the Linktest benchmark.
The time it takes before an action can be executed. For Linktest this is the time it takes between a message being ready to be sent till sending begins.
For the relationship between latency transit time and message size see Communication Time.
The message size is used to refer to the size of messages in bytes used by Linktest to benchmark communication. For the relationship between latency transit time and message size see Communication Time. Note that many communication APIs only support message sizes up to 2 GiB. For 32-bit MPI implementations the cumulative message size of all messages is restricted in total to less than 2 GiB.
Randomizing Testing Order
Although by default Linktest tests the connection between a given task and all other tasks, results may depend on the order in which the testing is performed. The
--randomize command-line option causes the testing order to be randomly mixed which means that consecutive Linktest runs with this on will likely test physical connections in a different order.
Number Of Messages
Linktest benchmarks communications by repeating a communication many times. The amount of times it repeats the sending of messages for timing purposes is controlled via the
--num-messages command-line argument. This defines how many times the back-and-forth sending of messages is repeated for timing purposes. The final returned times are the average time it took the message to be sent back-and-forth.
Number of Warm-Up Messages
Linktest warms up connections by testing them multiple times before timing begins. Basically the same actions as during timing occur multiple times beforehand. This is often done because connections need to be first initialized, which means that sending a message the first time often takes longer than when it is sent the second time a short time afterwards. During the first time, sometimes couple of times, a message is sent over a network the network optimizes itself for the transmission of the message, i.e. it becomes primed for this message. As such it often makes sense to include at least one warm-up message before benchmarking a connection. For small message sizes more should be used, 3-5 work well. In Linktest this number of warm-up messages must be stipulated via the
--num-warmup-messages command-line argument.
Ping-Pong tests are a standard tool for network operators. They can thought of as an extension to the
ping command used to test for the accessibility of machines for a given address, which is a ping test. In a ping test a message is sent from an origin to a destination and the time is taken at the origin till a receipt is received at the destination. Ping-pong tests extend this by timing at the origin till the original message is received back again, i.e. the pong in ping-pong testing. In birectional testing the sending of messages is done by the origin and destination congruently, see Bidirectional Testing.
Ping-pong testing is useful to measure network latency and bandwidth. It is also less susceptible to differences in speed in a given direction since only the time it took for the message to go in both directions is recorded.
The central pingpong kernel of linktest between rank A and B is executed in the following way.
(num_warmup_messages = W, num_messages = N and size_messages = S)
A sends W messages of size S to B
B sends W messages of size S to A
A takes the time t1
A sends N messages of size S to B
B sends N messages of size S to A
A takes the time t2 after the last receive finished
A writes average time (t2-t1)/2N to the sion file
In the Matrix seen in Linktest reports this time corresponds to the entry in column A, ow B
By default Linktest tests as many connections as possible in parallel, this, however, can cause tests to interfere. This is sometimes desired, for example, if real-world network performance under a sustained network load is to be tested. In other cases peak performance without the interference of other parallel tests is desired. In this case serial testing is done, in which each connection between a pair of tasks is tested individually. This effectively serializes the test and will cause it to take significantly longer. Serial testing can be turned on in Linktest by using the
--serial-testing command-line option.
Some of the connections tested by Linktest will perform worse than others. By default Linktest retests some of the worst connections serially. This is to determine if the poor performance is due to conflicts with other parallel connections or other processes that run in parallel on the same node/CPU. If the times for a serially retested connection improves to expected values then that indicates that during the main measurement there was some type of conflict. It is a good indication that something might be wrong with a connection if said connection performance does not improve as expected after serial retesting.
The amount of connections to be serially retested can be controlled via the
--num-slowest command-line argument followed by a positive integer indicating the number of worst connections to serial retest.
Linktest can be used to apply a nearly continuous connection load to stress a network. This is useful to see how stable a network remains under a continuous load.
Linktest can be configured to stress test using two command-line arguments.
--min-iterations followed by a positive integer indicates how many times the main test of Linktest is at least repeated.
--min-runtime followed by an integer indicates at least how long Linktest should repeat the main test. Linktest only stops repeating the main test if both are satisfied. If only one is specified then Linktest only tests against that one.
Transit time is the time it takes for an object to go from its origin to its destination. During this travel period the object is said to be in transit. For communication times, i.e. the time the message is in transit
For the relationship between latency transit time and message size see Communication Time.
In the OSI model a transport layer is conceptual division of the methods and protocols related to the transport of information, generally in terms of bytes. In Linktest it defines the API (the aforementioned methods and protocols) used to communicate data between, or within, systems. It is generally used in conjunction with which communication-API Linktest should test, which is controlled over the
--mode option. It, however, should not be confused with the communication API used for testing. The transport layer is an abstract concept. Linktest uses the communication API for the actual establishment and testing of connections.