Use a proper PDF Viewer!!!

Before looking at the reports on this page make sure you are using a proper PDF viewer, like the newest version of Adobe Acrobat Reader. As of the time of writing (10.10.2021) many PDF viewers, especially those bundled with web browsers (Firefox, Chrome, Edge etc.) interpolate pixels linearly to screen resolution when displaying raster images (think PNG or JPG). This is a suboptimal behaviour for our display of colour-indexed images in our reports as it makes them difficult to interpret. Viewers that use nearest-neighbour interpolation respect the image they are meant to display correctly, which causes the reports here to be displayed correctly.

Why Warm-Up Messages Should Be Used

Here is an example of why warm-up messages are important. This example show a UCP test on JUSUF. In it we see atrociously poor timings along the right half of the anti-diagonal. This is because these connections are tested first and only when the first message is sent are the UCP workers primed, as a result the first time the connection is tested the corresponding connection and worker need to be initialized. This effect averages out if more messages are averaged over or if the message size is larger. Effectively the right-side of the anti-diagonal shows the latency in finishing worker initialization and setting up connections.

In the attached "Serial" report we see that this effect can also appears if connections are serially tested. We see though that the first tested connection between 0 and 63 is the slowest, this because the serially tested later anti-diagonal connections at that point are likely finished establishing their workers, their only remaining delay is establishing the connections to the relevant controllers.

In the other attached report warm-up messages were included, in which case the effect vanishes. In it we can see the underlying structure of the full-fat--tree structure of the JUSUF interconnect.

This demonstrates why using warm-up messages important. Using just 3 already helps improve results dramatically.

linktest_jusuf_ucp_64nx1c_512B_1Msg.pdf

linktest_jusuf_ucp_64nx1c_512B_1Msg_Serial.pdf

linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf

Testing Inter- & Intra-CPU Communication Performance For A Single Node: A AMD EPYC 7742 Case Study

LinkTest cannot only be used to test connections between computers, or nodes in a HPC setting, it can also be used to benchmark the intra-node connectivity between cores. In this short case study we will have a look at two JURECA-DC cluster nodes, each is equipped with 2 64-core AMD EPYC 7742 CPUs and 16 32 GiB DDR4 RAM sticks clocked at 3.2 GHz. We now wish to benchmark the communication performance between the cores in these nodes.

LinkTest is just the tool for this as it can benchmark communication between tasks, which run on CPU cores. For that we need to pin LinkTest tasks to physical CPU cores when executing LinkTest. Using SLURM's srun this can be done using the --cpu-bind=map_cpu:`seq -s, 0 127` command-line argument. Pinning tasks to physical cores ensures that we only test connectivity between physical and not with logical cores via simultaneous multi-threading.

With the correct pinning we can now test the core-to-core connectivity on a node. For reliable numbers it is, however, imperative to use a sufficient number of messages, including warm-up messages. We chose to use 2000 messages for testing and 200 for warm up with a message size of 1 MiB. Testing was performed via ParaStation MPI 5.4.10-1 which uses local memory for transfers when possible.

The results of this test for two different identically configured nodes can be seen in this report: JURECA-DC_AMD-EPYC-7742.pdf. The upcoming paragraphs are based on this report. A newer test using unidirectional MPI and taking the median across 64 CPUs results in less noisy results: JURECA-DC_AMD-EPYC-7742_MedianCorrected.pdf. Note that for this report the central purple blocks along the diagonal come from a single test sending 8192 messages. This report is shown below:

To understand these results we need to understand how AMD EPYC 7742 CPUs are internally built up. These CPUs are 64-bit 64-core x86 server microprocessors based on the ZEN-2 micro-architecture with the logic dies fabricated using the TSMC 7 nm process, while the IO die is fabricated using GlobalFoundries 14nm process. They were first introduced in 2019. They have a base clock speed 2.25 GHz, which can boost up to 3.4 GHz on a single core. The processors support up to two-way simultaneous multi-threading, hence the need for pinning above.

Each CPU is built up of 8 CPU chiplets, also known as a Core Complex Die (CCD), which each house 8 cores split into two groups of 4, which are known as a Core CompleX (CCX), which share their 16 MiB (4 times 4 MiB) L3 Cache. 2 CCDs are further abstracted as a quadrant. Now this structure is very important as we see in the results.

Let us begin by looking at the large scale structures in the indexed image. Four blocks are easily identifiable, 2 blue blocks with some purple on the main diagonal and two red off-diagonal blocks. Recall that the nodes we are testing consist of two 64-core CPUs. The blue blocks show intra-CPU core timings, while the red blocks show the inter-CPU core timings. The inter-core CPU timings are slower as communication must occur between the CPUs via the motherboard.

Now let us look at the finer details. The purple blocks along the diagonal group the CPUs cores into groups of 4 and correspond exactly to a CCX. From this we learn that communication is fastest inside a CCX, where L3 Caches are shared. We see no CCD related structure, blocks 8-cores wide, but we see Quadrant related structures. Communication inside a quadrant is generally quite fast, however, there is always a quadrant that can be communicated faster to, see the dark blue blocks. It is theorized that this has to do with the layout of the IO die at the center of EPYC CPU, which, according to die shots houses two connected memory controllers, which each connect two quadrants. When communicating from one CCD to the other CCD in the same quadrant the message is sent to the IO die, buffered and sent over the same connection back to the other CCD in the quadrant. When communicating from a CCD in a quadrant to another CCD in a quadrant attached to the same memory controller the memory controller does not need to buffer the message but can stream to the other CCD which is faster. When communicating between CCDs attached to different memory controllers the introduced latency by going from one memory controller to the other seems comparable to that of buffering the message for 1 MiB.

We see the same behavior for communication between the quadrants of the two different CPUs, where certain quadrants communicate faster. It seems as if 2 quadrants have consistently worse inter-CPU performance than the other two. This may again be related to the IO die.

The second page of the report shows the same LinkTest run for a different node identically configured and we see the same results, demonstrating that the results are reproducible using different CPUs from the same family.

This information is useful from an optimization standpoint as it suggests that communication should best be kept within a CCX. If that is not possible certain quadrant-to-quadrant communications are faster than other.

In conclusion this case study demonstrates that LinkTest can be used to benchmark inter- and intra-CPU communication between cores. Benchmarking of CPUs in such a fashion can help to optimize software for certain architectures by providing the necessary information on how best to communicate within a CPU.

Difference Between Uni-, Bi- & Semi-Directional Reports

The LinkTest testing method defines the results that you will get. It is up to you, the user, to interpret these results and to understand how the testing methodology affects results.

When the network topology between two nodes is isotropic, a fancy terminology to say that the two nodes can communicate with the same speed in both directions the test results when using uni-directional and semi-directional testing do not differ much. Bi-directional testing results may differ, but we will get back to that later.

The reason why the uni- and semi-directional results do not differ by much is because of how the connections are tested. In the uni-directional case the time-measuring node send a message to its partner and waits for a receipt message with zero length before stopping timing. To average out random variations in network performance and to reduce the effect inherent measuring delays caused by starting the and stopping the stopwatch the sending of the message and weighting for the zero-length receipt is looped over and the average time for each iteration is computed at the end.

Semi-directional results are generated in a similar fashion, the difference being that the zero-length returned receipt is now the original message that was sent. Additionally the measured time is divided by two to get a better estimate of the actual message transit time under the assumption that the two nodes can communicate with each other at the same speed.

This assumption is the problem. If the two nodes can not communicate with each other with the same speed, i.e. the network topology between the two nodes is anisotropic, then the unidirectional measurement will still approximately measure the correct time for one of the nodes to send a message to its partner, assuming that sending the zero-length receipt takes a negligible amount of time. Note that the measured times for the two nodes will now differ. The semi-directional result, however, will again be similar, but be somewhere in between the two unidirectional results. This is because the semi-directional result corresponds to the average result between the two nodes.

So the question is then when to use uni-directional and when to use semi-directional testing? Uni-directional testing should be used if it is expected that the connection speed between two nodes depends on the communication direction. If this is not the case, semidirectional testing will theoretically give a tighter upper-bound estimate on the actual unidirectional communication time as waiting for the zero-length receipt is not factored in to the time. Semidirectional testing should also be used if it is of interest to get an estimate of the communication performance between two nodes if they will be communicating with each other in equal parts.

Bidirectional testing is a different type of test entirely. Uni- and semi-directional testing have the partner node wait till it has received the message before sending the receipt back. In bidirectional testing both nodes send a message to their partner at the same time and timing stops when the message is received. Recall that the sending of messages occurs in a loop, if this loop has only one iteration special effects can be observed. The node with a slower sending communication with its partner will counter intuitively measure lower times. This is because sending is a non-blocking communication operation, when sending a message you do not wait for it to arrive at its destination before continuing with work. However, when receiving you do wait for the message. This causes the node that is slower in sending the message to receive the message sooner that its partner, which in turn causes it to measure a lower time. If the testing loop has multiple iterations this effect averages out as in the next iteration the node with the slower sending communication needs to wait disproportionately longer for the message from its partner.

What this means for bidirectional testing is that it should be used only when bidirectional communication is expected to cause communication congestion that will cause both messages to be transferred slower, or to test if this is the case. Bidirectional testing cannot pick up on network-topology anisotropy between nodes, it can, however, measure congestion effects when trying to communicate in both directions at the same time.

Let us now illustrate this with an example. We have a network with two nodes Mew and Mewtwo and an auxiliary node, the Pokémon Center. We now conduct uni-, semi- and bi-directional tests between the nodes and find that the results between the nodes for all three tests are very similar. Here are the reports.

Now Mewtwo catches Pokérus, which causes it to have to send messages with a size greater than 1 kiB via Bill's PC, which is only reachable via the Pokémon Center. All of a sudden messages from Mewtwo to Mew take twice as long to arrive. This effect can be measured. Redoing the three tests we find very different results. Here are the reports.

The changes make sense when we think of the discussion above. The network topology is now anisotropic between Mew and Mewtwo, which causes the unidirectional connection between Mewtwo and Mew to be slower. This has knock-on effects to the semi- and bi-directional tests causing the measured times to be larger.

Pokémon and its derivatives are a trademark of Nintendo, however, we believe this constitutes fair use. Note we are aware that Pokérus is generally a beneficial virus for Pokémon to catch, in this case though we needed a negative status ailment and Pokérus is the only official (computer) virus with a name.

Here are the SION files: Pokemon_Unidirectional.sion Pokemon_Semidirectional.sion Pokemon_Bidirectional.sion Pokemon_Unidirectional_Pokerus.sion Pokemon_Semidirectional_Pokerus.sion Pokemon_Bidirectional_Pokerus.sion

Comments

Please register or sign in to add a comment.

Noteworthy Reports

Use a proper PDF Viewer!!!

Why Warm-Up Messages Should Be Used

Testing Inter- & Intra-CPU Communication Performance For A Single Node: A AMD EPYC 7742 Case Study

Difference Between Uni-, Bi- & Semi-Directional Reports

Comments