... | ... | @@ -14,7 +14,7 @@ This demonstrates why using warm-up messages important. Using just 3 already hel |
|
|
|
|
|
[linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf](https://gitlab.jsc.fz-juelich.de/cstao-public/linktest/uploads/5fb8f26d40d4de586e8407313663d462/linktest_jusuf_ucp_64nx1c_512B_3Warm.pdf)
|
|
|
|
|
|
# Testing Inter-Node/Core Performance Communication Performance: A AMD EPYC 7742 Case Study
|
|
|
# Testing Inter- & Intra-CPU Communication Performance For A Single Node: A AMD EPYC 7742 Case Study
|
|
|
Linktest cannot only be used to test connections between computers, or nodes in a HPC setting, it can also be used to benchmark the intra-node connectivity between cores. In this short case study we will have a look at two JURECA-DC cluster nodes, each is equipped with 2 64-core AMD EPYC 7742 CPUs and 16 32 GiB DDR4 RAM sticks clocked at 3.2 GHz. We now wish to benchmark the communication performance between the cores in these nodes.
|
|
|
|
|
|
Linktest is just the tool for this as it can benchmark communication between tasks, which run on CPU cores. For that we need to pin Linktest tasks to physical CPU cores when executing Linktest. Using SLURM's `srun` this can be done using the `` --cpu-bind=map_cpu:`seq -s, 0 127` `` command-line argument. Pinning tasks to physical cores ensures that we only test connectivity between physical and not with logical cores via simultaneous multi-threading.
|
... | ... | @@ -25,7 +25,7 @@ The results of this test for two different identically configured nodes can be s |
|
|
|
|
|
To understand these results we need to understand how AMD EPYC 7742 CPUs are internally built up. These CPUs are 64-bit 64-core x86 server microprocessors based on the ZEN-2 micro-architecture with logic fabricated using the TSMC 7 nm process and IO fabricated using GlobalFoundries 14nm process. They were first introduced in 2019. They have a base clock speed 2.25 GHz, which can boost up to 3.4 GHz on a single core. The processors support up to two-way simultaneous multi-threading, hence the need for pinning above.
|
|
|
|
|
|
Each CPU is built up of 8 CPU chiplets, also known as a Core Complex Die (CCD), which each house 8 CPUs split into two groups of 4, which are known as a Core CompleX (CCX), which share their L3 Cache. 2 CCDs are further abstracted as a quadrant. Now this structure is very important as we see in the results.
|
|
|
Each CPU is built up of 8 CPU chiplets, also known as a Core Complex Die (CCD), which each house 8 cores split into two groups of 4, which are known as a Core CompleX (CCX), which share their L3 Cache. 2 CCDs are further abstracted as a quadrant. Now this structure is very important as we see in the results.
|
|
|
|
|
|
Let us begin by looking at the large scale structures in the indexed image. Four blocks are easily identifiable, 2 blue blocks with some purple on the main diagonal and two red off-diagonal blocks. Recall that the nodes we are testing consist of two 64-core CPUs. The blue blocks show intra-CPU core timings, while the red blocks show the inter-CPU core timings. The inter-core CPU timings are slower as communication must occur between the CPUs via the motherboard.
|
|
|
|
... | ... | |