... | ... | @@ -12,4 +12,29 @@ This demonstrates why using warm-up messages important. Using just 3 already hel |
|
|
|
|
|
[linktest_jusuf_ucp_64nx1c_512B_1Msg_Serial.pdf](https://gitlab.jsc.fz-juelich.de/cstao-public/linktest/uploads/4f95c810fbde680d7d92551cb7e65c3e/linktest_jusuf_ucp_64nx1c_512B_1Msg_Serial.pdf)
|
|
|
|
|
|
[linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf](https://gitlab.jsc.fz-juelich.de/cstao-public/linktest/uploads/5fb8f26d40d4de586e8407313663d462/linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf) |
|
|
\ No newline at end of file |
|
|
[linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf](https://gitlab.jsc.fz-juelich.de/cstao-public/linktest/uploads/5fb8f26d40d4de586e8407313663d462/linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf)
|
|
|
|
|
|
# Testing Inter-Node/Core Performance Communication Performance: A AMD EPYC 7742 Case Study
|
|
|
Linktest cannot only be used to test connections between computers, or nodes in a HPC setting, it can also be used to benchmark the intra-node connectivity between cores. In this short case study we will have a look at two JURECA-DC cluster nodes, each is equipped with 2 64-core AMD EPYC 7742 CPUs and 16 32 GiB DDR4 RAM sticks clocked at 3.2 GHz. We now wish to benchmark the communication performance between the cores in these nodes.
|
|
|
|
|
|
Linktest is just the tool for this as it can benchmark communication between tasks, which run on CPU cores. For that we need to pin Linktest tasks to physical CPU cores when executing Linktest. Using SLURM's `srun` this can be done using the `` --cpu-bind=map_cpu:`seq -s, 0 127` `` command-line argument. Pinning tasks to physical cores ensures that we only test connectivity between physical and not with logical cores via simultaneous multi-threading.
|
|
|
|
|
|
With the correct pinning we can now test the core-to-core connectivity on a node. For reliable numbers it is, however, imperative to use a sufficient number of messages, including warm-up messages. We chose to use 2000 messages for testing and 200 for warm up with a message size of 1 MiB. Testing was performed via ParaStation MPI 5.4.10-1 which uses local memory for transfers when possible.
|
|
|
|
|
|
The results of this test for two different identically configured nodes can be seen in this report: [JURECA-DC_AMD-EPYC-7742.pdf](uploads/15b18b5e70bef06406ca25d33e6e8766/JURECA-DC_AMD-EPYC-7742.pdf).
|
|
|
|
|
|
To understand these results we need to understand how AMD EPYC 7742 CPUs are internally built up. These CPUs are 64-bit 64-core x86 server microprocessors based on the ZEN-2 micro-architecture with logic fabricated using the TSMC 7 nm process and IO fabricated using GlobalFoundries 14nm process. They were first introduced in 2019. They have a base clock speed 2.25 GHz, which can boost up to 3.4 GHz on a single core. The processors support up to two-way simultaneous multi-threading, hence the need for pinning above.
|
|
|
|
|
|
Each CPU is built up of 8 CPU chiplets, also known as a Core Complex Die (CCD), which each house 8 CPUs split into two groups of 4, which are known as a Core CompleX (CCX), which share their L3 Cache. 2 CCDs are further abstracted as a quadrant. Now this structure is very important as we see in the results.
|
|
|
|
|
|
Let us begin by looking at the large scale structures in the indexed image. Four blocks are easily identifiable, 2 blue blocks with some purple on the main diagonal and two red off-diagonal blocks. Recall that the nodes we are testing consist of two 64-core CPUs. The blue blocks show intra-CPU core timings, while the red blocks show the inter-CPU core timings. The inter-core CPU timings are slower as communication must occur between the CPUs via the motherboard.
|
|
|
|
|
|
Now let us look at the finer details. The purple blocks along the diagonal group the CPUs cores into groups of 4 and correspond exactly to a CCX. From this we learn that communication is fastest inside a CCX, where L3 Caches are shared. We see no CCD related structure, blocks 8-cores wide, but we see Quadrant related structures. Communication inside a quadrant is generally quite fast, although communication to other quadrants might be faster, this suggests that there might be some interference when communicating within ones quadrant. Communication between quadrants, however, can also be substantially slower as with the dark blue blocks.
|
|
|
|
|
|
We see the same behavior for communication between the quadrants of the two different CPUs, where certain quadrants communicate slower. It seems as if 2 quadrants have consistently worse inter-quadrant and inter-CPU performance than the other two.
|
|
|
|
|
|
The second page of the report shows the same Linktest run for a different node identically configured and we see the same results, demonstrating that the results are reproducible using different CPUs from the same family.
|
|
|
|
|
|
This information is useful from an optimization standpoint as it suggests that communication should best be kept within a CCX. If that is not possible certain quadrant-to-quadrant communications are faster than other.
|
|
|
|
|
|
In conclusion this case study demonstrates that Linktest can be used to benchmark inter- and intra-CPU communication between cores. Benchmarking of CPUs in such a fashion can help to optimize software for certain architectures by providing the necessary information on how best to communicate within a CPU. |
|
|
\ No newline at end of file |