|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
# My SION files are HUGE! What can I do?
|
|
|
|
See [SION File Defragmentation](Usage#sion-file-defragmentation). Note that you can still load the defragmented SION file into python, see [Linktest Python Reader](Linktest-Python-Reader), and it can still be used to generate reports, see [Linktest Report](Linktest-Report). After defragmentation the file can be further compressed using any non-lossy compression tools. The resultant compressed file can no longer be loaded into python and hence reports based on it cannot be generated unless the file is decompressed first.
|
|
|
|
|
|
|
|
# I am running a latency test and the first row in my timings matrix is much slower than the others. What can I do?
|
|
|
|
TLDR: You likely forgot to use warm-up messages.
|
|
|
|
|
|
|
|
This depends on what you want to measure. Most systems in a computer operate on a on-demand basis to conserve resources. That means that connections are only established and relevant devices intialized the first they are used. For a default Linktest run without randomization of the test order the first row in the timing matrix corresponds to the first connections that were tested. This means that the relevant connections and associated hardware, like the required interconnects, had to be initialized which takes times. The other rows in the timing matrix are in the default scenario measured later and as a result their latency only depends on the initialization of the connection.
|
|
|
|
|
|
|
|
This means that if you want to approximate the latency due to the start up of the relevant devices you can take the first row and from each element subtract the average time for each column, excluding the first element.
|
|
|
|
|
|
|
|
If we want our timing information not to be corrupted by these start-up times for the connections and associated hardware then we have two options:
|
|
|
|
|
|
|
|
1. We can turn on all-to-all testing if using the MPI communication API. All-to-all testing occurs before and after the main test. As all-to-all testing requires a subset of the connections is likely that all relevant hardware is initialized and some of the to-be-tested connections. This, however, is not a perfect solution.
|
|
|
|
2. Use warm-up messages. Using a non-zero number of warm-up messages causes connections to be pretested, also applies to all-to-all. As such timing only occurres using initialized connections. This will hence fully remove the described error that the first row exhibits slower timings and will in general improve timings across the board.
|
|
|
|
|
|
|
|
Note that Linktest does not have the ability to explicitly pre-initialise hardware or connections.
|
|
|
|
|
|
|
|
# My Linktest timing matrix has a checkerboard pattern when running multiple tasks per node. What can I do?
|
|
|
|
TLDR: Change your process pinning.
|
|
|
|
|
|
|
|
This is an artifact of your process pinning. Due to the way in which modern CPUs are constructed certain CPU cores have faster access to certain hardware devices, and hence faster to connections to the CPUs of other nodes, than other cores. This manifests itself commonly in checkerboard patterns in the timing matrix. The checkerboard pattern can commonly be avoided by reorganizing the rows and columns. This reorganization can be achieved by changing the processor pinning when Linktest is executed. For more information on how this is done please see the documentation for the tools you use to execute Linktest in parallel, for example `mpiexec` or `srun`.
|
|
|
|
|