- My SION files are HUGE! What can I do?
- In LinkTest Report (the python tool) I cannot read the indexed-image tick labels. Can I increase the font size?
- LinkTest Report (the python tool) takes too long. Can it go faster?
- The colourbar in the report extends outside its bounding box! How can I fix this?
- What do the weird unit prefixes, like ki, Mi and Gi, mean?
- I am running a latency test and the first row in my timings matrix is much slower than the others. What can I do?
- My LinkTest timing matrix has a checkerboard pattern when running multiple tasks per node. What can I do?
- How can I generate an animated GIF of multiple indexed-images from the Python reports, e.g. for presentations?
My SION files are HUGE! What can I do?
If you write out your SION files in parallel, which causes fragmented SION files to be written out, see SION File Defragmentation, otherwise only non-lossy compression can help you further.
Note that you can still load defragmented SION files into python, see LinkTest Python Reader, and they can still be used to generate reports, see LinkTest Report. After defragmentation compressed SION files can be further compressed using any non-lossy compression tools. The resultant compressed file can no longer be loaded into python and hence reports based on it cannot be generated unless the file is decompressed first.
In LinkTest Report (the python tool) I cannot read the indexed-image tick labels. Can I increase the font size?
You may be able to indirectly increase the font size. The font size is limited by two factors, the number of ticks, which limits the vertical height of the text of each tick, and the maximum length of each tick label. See the --downsampling_factor_matrix_ticks
option in the LinkTest Report options to learn how to reduce the number of ticks plotted, which increases the vertical space allocated to each tick label. See the --domain
option in the LinkTest Report options to earn how domains, or any string, can be removed from the tick labels. Shortening tick labels allows for larger font sizes for a given maximum tick-label width.
LinkTest Report (the python tool) takes too long. Can it go faster?
Probably, here are a few things that will speed up report generation:
-
Defragment the SION file before using it to generate a report, see SION File Defragmentation. This will speed up loading the data into python. However, if you only plan to generate one report this is likely not worth as the time gained in making the report is lost during the defragmentation of the SION file.
-
Use the
--downsampling_factor_matrix_ticks
option, see the LinkTest Report options. Plotting tick labels in MatPlotLib is very slow, as such reducing the number of tick labels to plot also speeds up the report generation. An added bonus is that tick labels may also become larger, making them easier to read. -
If you are prone to cancelling seemingly hanging processes early because of no command-line output use the
verbose
option to see timing information for segments of the report generation. -
Use a newer version of Python or MatPlotLib. Although the report tool was originally developed for Python 3.8.5 and MatPlotLib version 3.3.1 upgrading MatPlotLib version 3.3.4 improved a 2 minute run using a defragmented SION file by approximately 15%. Upgrading to Python 3.9.0 cut the time to just above 1 minute. The problem is mostly the slow MatPlotLib back end for generating plots. The back ends are optimized for quality, not performance. Profiling indicates that for larger SION files, 500 MiB and above after defragmentation, the MatPlotLib back end takes up about 80% of the compute time of the report.
-
Use the supplied pingponganalysis tools. These create postscript files that can be converted to pdf. Generating a comparable PDF report to the above mentioned 2 minute report only takes about 5 to 10 seconds. Please note that the pingponganalysis tools are only kept up-to-date with the current version of LinkTest.
-
Read the SION files directly into Python and inspect the data there using the LinkTest Python Reader. This does not substitute a nice and easy to read report, but gives you the flexibility of looking at the data more in depth or to produce figures that better fit your needs.
The colourbar in the report extends outside its bounding box! How can I fix this?
TLDR: Update MatPlotLib and Python.
This bug is due to a MatPlotLib bug, which is first mentioned in issue #6827. As we are relying on MatPlotLib for plotting we cannot fix this on our side. Once it has been patched out update Python and MatPlotLib.
Alternatively, you can can increase the default rasterization resolution in PPI (Pixels Per Inch), which MatPlotLib for historic reasons calls DPI (Dots Per Inch, a quantity important for printing). To see how you can modify these variables globally checkout the MatPlotLib Customization Guide. Often a value of 300 is enough. WARNING! This will increase you file size.
What do the weird unit prefixes, like ki, Mi and Gi, mean?
TLDR: They are binary prefixes.
These are the prefixes for binary sizes, which use a base of 1024 (2^10), stipulated by the International Electrotechnical Commission (IEC) and accepted as an international Standard by the International Organization for Standardization (ISO) in Standard ISO 80000. They use the nearest multiple of 2 to the common Système International d'unités (SI), now known as metric prefixes, which have a base of 1000.
Here is a comparison table between the two standards:
Metric | Value | Binary | Value | Ratio B/M |
---|---|---|---|---|
NONE | 10^0 |
NONE | 2^0 |
1.000 |
k | 10^3 |
ki | 2^{10} |
1.024 |
M | 10^6 |
Mi | 2^{20} |
1.049 |
G | 10^9 |
Ti | 2^{30} |
1.073 |
T | 10^{12} |
Gi | 2^{40} |
1.100 |
A common problem with these unit prefixes is that they are equated to metric prefixes, however, for larger units the difference between prefixes grows substantially as indicated in the fifth column of the table, which shows the ratio of binary prefix value to the corresponding metric prefix value. Lesson to learn, do not equate these prefixes! A 7.3% difference may not seem like a lot but when benchmarking connections it can be the difference between the value you expect and the one LinkTest returns.
I am running a latency test and the first row in my timings matrix is much slower than the others. What can I do?
TLDR: You likely forgot to use warm-up messages.
This depends on what you want to measure. Most systems in a computer operate on a on-demand basis to conserve resources. That means that connections are only established and relevant devices initialized the first they are used. For a default LinkTest run without randomization of the test order the first row in the timing matrix corresponds to the first connections that were tested. This means that the relevant connections and associated hardware, like the required interconnects, had to be initialized which takes times. The other rows in the timing matrix are in the default scenario measured later and as a result their latency only depends on the initialization of the connection.
This means that if you want to approximate the latency due to the start up of the relevant devices you can take the first row and from each element subtract the average time for each column, excluding the first element.
If we want our timing information not to be corrupted by these start-up times for the connections and associated hardware then we have two options:
- We can turn on all-to-all testing if using the MPI communication API. All-to-all testing occurs before and after the main test. As all-to-all testing requires a subset of the connections is likely that all relevant hardware is initialized and some of the to-be-tested connections. This, however, is not a perfect solution.
- Use warm-up messages. Using a non-zero number of warm-up messages causes connections to be pretested, also applies to all-to-all. As such timing only occurs using initialized connections. This will hence fully remove the described error that the first row exhibits slower timings and will in general improve timings across the board.
Note that LinkTest does not have the ability to explicitly pre-initialise hardware or connections.
My LinkTest timing matrix has a checkerboard pattern when running multiple tasks per node. What can I do?
TLDR: Change your process pinning.
This is an artifact of your process pinning. Due to the way in which modern CPUs are constructed certain CPU cores have faster access to certain hardware devices, and hence faster to connections to the CPUs of other nodes, than other cores. This manifests itself commonly in checkerboard patterns in the timing matrix. The checkerboard pattern can commonly be avoided by reorganizing the rows and columns. This reorganization can be achieved by changing the processor pinning when LinkTest is executed. For more information on how this is done please see the documentation for the tools you use to execute LinkTest in parallel, for example mpiexec
or srun
.
How can I generate an animated GIF of multiple indexed-images from the Python reports, e.g. for presentations?
This can be done using Image Magick, a command-line image-manipulation tool. The basic idea is to combine the various PDF reports that you want to include in the GIF into one PDF, for example using pdfunite
, a commonly available tool on many Linux distributions. Then process the PDF using Image Magick as follows:
convert -density 150 -crop 1070x910+85+70 -delay 50 -dispose previous input.pdf -coalesce -layers OptimizePlus +repage output.gif
where input.pdf
is the name and path of the combined PDF file containing the reports from which you wish to extract the indexed images and output.gif
is the name and path of the output GIF file. The -density
density option controls the image resolution. Increasing density increases file size. The -crop
crop options crops loaded and rasterized PDF to the correct size. If you change the density you also need to change the crop, i.e. if you double the density you need to double the crop numbers as well. The -delay
option controls the delay length between frames in the GIF, increase this number to increase the delay between frames. The -dispose
option causes GIF frames to overwrite each other instead of stacking on top of each other. The -coalesce
option causes the individual PDF pages to be coalesced into one GIF. The -layers OptimizePlus
option optimizes the GIF to reduce file size. The +repage
option causes the geometry information from the original PDF to be lost so that the GIF displays correctly.