... | ... | @@ -29,9 +29,9 @@ Each CPU is built up of 8 CPU chiplets, also known as a Core Complex Die (CCD), |
|
|
|
|
|
Let us begin by looking at the large scale structures in the indexed image. Four blocks are easily identifiable, 2 blue blocks with some purple on the main diagonal and two red off-diagonal blocks. Recall that the nodes we are testing consist of two 64-core CPUs. The blue blocks show intra-CPU core timings, while the red blocks show the inter-CPU core timings. The inter-core CPU timings are slower as communication must occur between the CPUs via the motherboard.
|
|
|
|
|
|
Now let us look at the finer details. The purple blocks along the diagonal group the CPUs cores into groups of 4 and correspond exactly to a CCX. From this we learn that communication is fastest inside a CCX, where L3 Caches are shared. We see no CCD related structure, blocks 8-cores wide, but we see Quadrant related structures. Communication inside a quadrant is generally quite fast, although communication to other quadrants might be faster, this suggests that there might be some interference when communicating within ones quadrant. Communication between quadrants, however, can also be substantially slower as with the dark blue blocks.
|
|
|
Now let us look at the finer details. The purple blocks along the diagonal group the CPUs cores into groups of 4 and correspond exactly to a CCX. From this we learn that communication is fastest inside a CCX, where L3 Caches are shared. We see no CCD related structure, blocks 8-cores wide, but we see Quadrant related structures. Communication inside a quadrant is generally quite fast, however, there is always a quadrant that can be communicated faster to, see the dark blue blocks. It is theorized that this has to do with the layout of the IO die at the center of EPYC CPU, which, according to die shots houses two connected memory controllers, which each connect two quadrants. When communicating from one CCD to the other CCD in the same quadrant the message is sent to the IO die, buffered and sent over the same connection back to the other CCD in the quadrant. When communicating from a CCD in a quadrant to another CCD in a quadrant attached to the same memory controller the memory controller does not need to buffer the message but can stream to the other CCD which is faster. When communicating between CCDs attached to different memory controllers the introduced latency by going from one memory controller to the other seems comparable to that of buffering the message for 1 MiB.
|
|
|
|
|
|
We see the same behavior for communication between the quadrants of the two different CPUs, where certain quadrants communicate slower. It seems as if 2 quadrants have consistently worse inter-quadrant and inter-CPU performance than the other two.
|
|
|
We see the same behavior for communication between the quadrants of the two different CPUs, where certain quadrants communicate faster. It seems as if 2 quadrants have consistently worse inter-CPU performance than the other two. This may again be related to the IO die.
|
|
|
|
|
|
The second page of the report shows the same Linktest run for a different node identically configured and we see the same results, demonstrating that the results are reproducible using different CPUs from the same family.
|
|
|
|
... | ... | |