- Use a proper PDF Viewer!!!
- Introduction
- Options
-
Description Of LinkTest Reports
- The Report Header
- The Matrix Plot
- The Histogram Plot
-
The Textual Report
- The Communication API
- The Number of Tasks
- The Number of Hosts
- The Number of Messages
- The Message Size
- The Number of Warm-Up Messages
- The Number of Serial Retests
- The Domain
- The Execution Order
- All-to-All Testing
- Mixing Ranks
- Test Configuration
- Bisection Testing
- Using GPU Memory
- Buffer PRNG Seed
- Number of Multiple Buffers
- The Minimum Timing Value and the Maximum Bandwidth
- The Average Timing Value and the Average Bandwidth
- The Maximum Timing Value and the Minimum Bandwidth
- The Minimum All-to-All Timing Value and the Maximum All-to-All Bandwidth
- The Average All-to-All Timing Value and the Average All-to-All Bandwidth
- The Maximum All-to-All Timing Value and the Minimum All-to-All Bandwidth
- The All-To-All Speedup Compared To Equivalent Point-to-Point Communications
- The Bisection Bandwidth
- Retesting of Worst Connections in Serial
- The Version Footer
- Troubleshoot
Use a proper PDF Viewer!!!
Before looking at LinkTest reports make sure you are using a proper PDF viewer, like the newest version of Adobe Acrobat Reader. As of the time of writing (10.10.2021) many PDF viewers, especially those bundled with web browsers (Firefox, Chrome, Edge etc.) interpolate pixels linearly to screen resolution when displaying raster images (think PNG or JPG). This is a suboptimal behaviour for our display of colour-indexed images in our reports as it makes them difficult to interpret. Viewers that use nearest-neighbour interpolation respect the image they are meant to display correctly, which causes the reports here to be displayed correctly.
Introduction
LinkTest reports can be generated using the linktest-report
python tool that should have been installed during the build process. LinkTest reports graphically display the data generated by linktest and give a textual overview of the LinkTest benchmark configuration used to generate the data. The data is displayed in two fashions: 1) using a matrix plot, also known as an indexed image or heat map, where the colour of each pixel indicates the communication time between the two associated nodes. 2) using a colored histogram which indicates how many connections fall within a certain time range. Below the two figures is a textual summary of the LinkTest data and of the LinkTest benchmark configuration.
If linktest-report
is available on the system path it can be used as follows:
linktest-report -i output.sion -o report.pdf
to create a PDF report, if PDFs are supported, called report.pdf
of the LinkTest output data in output.sion
. Note that the file type of the generated report can be changed to various different supported types by changing the extension from pdf
to a supported file type. As such the following is also valid:
linktest-report -i output.sion -o report.png
which creates a PNG report. For an overview of the supported file type consult the MatPlotLib documentation for your system and install as well as its configuration.
It is also possible to directly generate the reports from python after you have read in the data, see Generating Reports From Within Python.
Note that when generating raster reports your default Pixels Per Inch (PPI) settings are used. This may result in poor quality results. In this case please adapt the MatPlotLib defaults as necessary.
By default LinkTest reports are generated on A4 pages. Other page sizes are not supported. Units adhere to the SI standard, aside from binary sizes which adhere to the IEC convention of using a base of 1 024 as opposed to the SI base of 1 000. Label sizes adjust automatically to fit the allocated space. Label sizes below 1 pt are sadly not supported by the python package MatPlotLib, which is used to generate the report. Note that the size of 1 pt may depend on your system configuration. It is further assumed that your system adheres to the international definition of the inch as 2.54 cm.
Options
The generated report can be minimalistically configured from the command line. The following options are supported:
-h
or --help
: Brings up textual help for linktest-report
. Version information can be found in the textual help.
-i
or --input
: Specifies that the following text specifies the filename of the input SION file to be used when generating the report. The input file must be a SION file.
-o
or --output
: Specifies that the following text specifies the filename and file type, via filename extension, of the report to be generated. Various file types are likely to be supported on your system via the python package MatPlotLib. It is suggested to generate PDF files due to their compact nature and universal accessibility.
-t
or --title
: Specifies that the following test specifies the title that should be used in the report. The default is to use the name of the input SION file. Note that title with spaces can be specified by enclosing the title in quotes. Consider for example: -title "A Title"
. With this option configuration the title of the report will be: A Title.
-d
or --domain
: Specifies that the following text specifies a domain name to be removed from the host names before the report is generated. This is useful to shorten host names if they are very long, or redundant. The default is an empty domain name, which means that nothing is removed from the host names.
--min_time
: Specifies that the following floating-point number should be used to clip times before generating the plots. This means that all times generated by the main LinkTest test less than this value are set to this value before plotting. This can be used to control the minimum value for the colourbar and the minimum bin in the histogram. If this option causes the data to be clipped then "WARNING: Figures Clipped!!!" will be written under the date in red at the top of the report. The clip value(s) can then be determined by looking at the colourbar and corresponding minimum and maximum values in the textual component of the report. If values are clipped a warning is issued under the date and time the SION file was written in the top right corner.
--max_time
: Specifies that the following floating-point number should be used to clip times before generating the plots. This means that all times generated by the main LinkTest test greater than this value are set to this value before plotting. This can be used to control the maximum value for the colourbar and the maximum bin in the histogram. If this option causes the data to be clipped then "WARNING: Figures Clipped!!!" will be written under the date in red at the top of the report. The clip value(s) can then be determined by looking at the colourbar and corresponding minimum and maximum values in the textual component of the report. If values are clipped a warning is issued under the date and time the SION file was written in the top right corner.
--replace_slowest
: If specified the times from the retested slowest connections replace the timings from the main test if they are smaller. This affects both the matrix plot as well as the histogram. If values are replaced a warning is issued under the date and time the SION file was written in the top right corner.
--downsampling_factor_matrix_ticks
: Specifies that the following integer should be used to down sample the ticks for the indexed-image plot. By default all host names are plotted unless this would cause the font size to fall below 1pt, in which case a down-sampling factor would be estimated. This down-sampling factor can override the calculated one if it is larger. The advantage of using a large down-sampling factor is that tick labels can be bigger and report generation is also often faster. Plotting tick labels is one of the slowest operations in MatPlotLib.
--verbose
: If specified additional timing information for segments of the report generation are provided. This useful for those prone to assuming that a tool is hanging when it generates no terminal output.
--indexed_image_colormap
: Specifies that the following text specifies the colourmap to be used for the matrix plot. See the MatPlotLib documentation for your system and the MatPlotLib configuration for supported colourmap names. The default is to use a rainbow colourmap.
--histogram_colormap
: Specifies that the following text specifies the colormap to be used for the histogram. See the MatPlotLib documentation for your system and the MatPlotLib configuration for supported colormap names. The default is to use a rainbow colourmap.
If -h
or --help
are not specified -i
or --input
and -o
or --output
are required. The other options are optional. -h
or --help
always takes precedence and as a result other options are ignored.
Description Of LinkTest Reports
For the description of LinkTest reports we will be using this report as a basis. This is a standard report including all-to-all testing. We will also show results from this bisection report. For other reports see Noteworthy Reports.
The Report Header
The title of the report can be found in the top right corner of the LinkTest report. It has the largest font size of any text in the report. It was user-configurable from the command line when the report was generated.
The approximate date and time when the SION file corresponding to the report was written can be found in the upper left corner. This information depends on queried information from the SION file which was written by task 0 in the LinkTest benchmark. As such it depends on the configuration of the system on which task 0 was ran.
If the data were clipped before plotting then "WARNING: Figures Clipped!!!" will be written under the date in red. The clip value(s) can then be determined by looking at the colourbar and corresponding minimum and maximum values in the textual component of the report.
The Matrix Plot
The matrix plot, also known as an indexed image or heat map, is the large figure with a colourbar on the right-hand side towards the top of the page. It takes up half of the page.
In the matrix plot the colour of each pixel corresponds, via the colourbar, to the average time it took for the host, or task, indicated at the left side of the figure to send a message to the host, or task, indicated by the task number at the top side of the figure and to receive a message back, i.e. the ping-pong time between the two hosts. Using the default rainbow colourbar connections that were not tested have white pixels.
The labels of y-axis on the left contains the host names followed by a period the corresponding task number preceded by zeros from LinkTest. Note that a given host may have more than one task associated to it. The x-axis at the top of the figure only shows the LinkTest task number. The corresponding host name can be queried from the y-axis.
In this figure we can for example see that the connection between host 13 on jrc0681.jureca
and host 8 on jrc0670.jureca
seems poor since the corresponding image pixel is red, indicating a low bandwidth. Checking the colourbar it can, however, be noticed that the bandwidth difference between the best connections, about 11.45 GiB/s and this connection, about 11.30 GiB/s, is likely negligible.
Since a default LinkTest benchmark was performed between all hosts, as such only the diagonal from the top-left corner to the bottom-right corner is missing as the corresponding connections were not measured, since they cannot be measured because a connection with oneself cannot be benchmarked. If a bisection test had been performed then only the upper-right and lower-left quadrants would be filled. The following image shows a bisection matrix plot.
The Histogram Plot
A histogram is another way to statistically analyse the LinkTest data. It shows how many connections fall within a given range, also known as a bin. The number of bins is equal to the number of ranks. The y-axis in this case indicate the number of connections in a given bin while the x-axis indicates the LinkTest timing information and associated bandwidths. The range of times a bin spans in this depiction corresponds to the actual range of times associated with the bin.
This data display is very useful when interpreting LinkTest results if multiple groups, or modes, of times can be identified. This often indicates that different types of transport or different transport routes are used. If one is testing the connection between multiple computers in a network these different groups might be related to through how many other computers a message has to travel through between the source and destination. In this scenario the fastest group may correspond to direct connections between the source and destination machines, the second group may then correspond to connections where an additional computer was in between, which slowed down the connection, and so forth. In this example such grouping is not present since all tests were performed on a single CPU. Below is an example of a histogram showing different groups, or modes, and some outliers. For more examples see Noteworthy Reports.
The Textual Report
The textual summary of the report can be found in a box at the end of the report. It summarizes the configuration of the LinkTest benchmark and summarizes key results. If applicable it also displays all-to-all data, bisection bandwidths and the worst serially retested connections.
The Communication API
Communication API:
indicates the communication API used for benchmarking. This information is read from the SION file and should correspond to a supported, or previously supported communication API. For more information see the supported LinkTest Communication APIs. TODO: include link.
The Number of Tasks
Number of Tasks:
indicates the number of tasks used to run LinkTest. This information is inferred from the stored list of host names in the SION file. Note that this must not correspond to the number of hosts used to run LinkTest, a given host may have run multiple tasks. It is assumed that the same number of tasks are executed on all hosts. If this is not the case the indicated value is likely wrong. To check whether this is the case inspect the list of host names in the SION file, the SION file can for example be imported into python using the LinkTest Reader. TODO: Include Link
The Number of Hosts
Number of Hosts:
indicates the number of hosts used to run LinkTest. This information is inferred from the stored list of host names in the SION file. Note that this must not correspond to the number of tasks used to run LinkTest, a given host may have run multiple tasks.
The Number of Messages
Number of Messages:
indicates the number of messages that LinkTest transmitted to average timing information over.
The Message Size
Message Size:
indicates the size of the messages LinkTest transmitted. Both the exact size in bytes as well as the corresponding size in IEC units are displayed.
The Number of Warm-Up Messages
# Warm-Up Messages:
indicates the number of warm-up messages that were transmitted to warm-up a connection before timing began. Warm-up messages should be used to improve timing information, see Noteworthy Reports.
The Number of Serial Retests
# Serial Retests:
indicates the number of serial retests of the worst connections performed by LinkTest. If non-zero up to five of the worst connections are shown at the end of the textual component of the report.
The Domain
Domain:
indicates the user specified domain to remove from the host names when the report was generated.
The Execution Order
Execution Order:
indicates the way in which the benchmark was performed. Either connections were tested in parallel, if parallel
is indicated, or sequentially, if serial
is indicated.
All-to-All Testing
All-to-All:
indicates whether additional all-to-all
tests were performed before and after the main test. If yes additional info can be found towards the bottom of the textual component of the report.
Mixing Ranks
Mixed Ranks:
indicates whether the order in which connections were tested was the fixed default order or if the order was randomized. Randomizing the order is useful to avoid systematic errors when multiply testing systems on which results depend on the order in which connections are tested. The actual order in which connections were tested is not shown in the report. The SION file needs to be inspected for this. See the python example for how to load this information into python. TODO: Include Link
Test Configuration
Test Configuration:
indicates how the LinkTest benchmark was performed. The possible options are: 1) bidirectionally, in which case messages were exchanged between source and receiver at the same time, or 2) semidirectionally, in which case first the message was sent from the source to the receiver and then the message was returned by the receiver to the source, or 3) unidirectionally, in which the source sends all the messages back to back before waiting for a receipt.
Bisection Testing
Bisection:
indicates whether a bisection test was executed or not. If a bisection test was executed then the set of LinkTest tasks was split into two halves and connection timing only occurred between the two. In this case only the lower-left and upper-right quadrants of the Matrix Plot should be filled. If a bisection test occurred the bisection bandwidth will be displayed below.
Using GPU Memory
Use GPUs:
indicates whether or not the GPU memory was used to host the messages, if not the main memory of the CPU is used by default. This option is useful to test the sending of messages between GPUs.
Buffer PRNG Seed
Buffer PRNG Seed:
indicates the 32-bit Mersenne Twister (a type of Pseudo-Random Number Generator [PRNG]) seed for the Mesenne Prime 2^19937-1 used to randomize the buffers. If N/A
then the buffers were not randomized.
Number of Multiple Buffers
# Multiple Buffers
: indicates the number of multiple buffers used. If None
then no multiple buffers were used. This currently only relevant for the unidirectional MPI case.
The Minimum Timing Value and the Maximum Bandwidth
Min. Value:
indicates the fastest connection time and associated highest bandwidth below.
The Average Timing Value and the Average Bandwidth
Average Value:
indicates the average connection time and associated average bandwidth below.
The Maximum Timing Value and the Minimum Bandwidth
Max. Value:
indicates the slowest connection time and associated lowest bandwidth below.
The Minimum All-to-All Timing Value and the Maximum All-to-All Bandwidth
Min. Value:
indicates the fastest all-to-all connection time and associated highest all-to-all bandwidth below. This information is only present if all-to-all testing was enabled.
The Average All-to-All Timing Value and the Average All-to-All Bandwidth
Average Value:
indicates the average all-to-all connection time and associated average all-to-all bandwidth below. This information is only present if all-to-all testing was enabled.
The Maximum All-to-All Timing Value and the Minimum All-to-All Bandwidth
Max. Value:
indicates the slowest all-to-all connection time and associated lowest all-to-all bandwidth below. This information is only present if all-to-all testing was enabled.
The All-To-All Speedup Compared To Equivalent Point-to-Point Communications
A2A Speedup:
indicates the all-to-all communication speedup compared to if the information had been distributed via point-to-point communications. This information is only present if all-to-all testing was enabled. This value is computed by taking the average LinkTest connection time and multiplying it by n(n-1), where n is the number of LinkTest tasks, and dividing by the average all-to-all time. This information is only present if all-to-all testing was enabled and otherwise a default test was performed, i.e. no bisection test was performed.
The Bisection Bandwidth
Bisection Bandwidth:
indicates the average aggregated bisection bandwidth between two bisecting halves found by splitting the available LinkTest tasks into two halves and then testing between the two. This value is determined by averaging the bisection times and multiplying by half the number of tasks before determining the bandwidth. This information is only present if bisection testing was enabled. For an example see the figure above.
Retesting of Worst Connections in Serial
If the number of serial retests is non-zero then the rest of the available textual header contains information on the worst serially retested connections of the form:
{Host_A} to {Host_B}: {Old_Time} ({Old_Bandwidth}) -> {New_Time} ({New_Bandwidth})
where:
{Host_A}
is the source host for the bad connection.
{Host_B}
is the receiving host for the bad connection.
{Old_Time}
is the originally measured bad connection time.
{Old_Bandwidth}
is the originally measured bad bandwidth.
{New_Time}
is the new serially remeasured connection time.
{New_Bandwidth}
is the new serially remeasured bandwidth.
In the reference report we see the following line:
jrc0147.jureca.17 to jrc0114.jureca.05: 1.3763 ms (11.353 GiB/s) 1.3669 ms (11.431 GiB/s)
This tells us that the connection between jrc0147.jureca
and jrc0114.jureca
, where the same host-name convention is used as for the matrix plot, is likely bad. Serially re-testing the connection improved the connection from 1.3763 ms to 1.3669 ms, correspondingly the bandwidth improved from 11.353 GiB/s to 11.431 GiB/s. This suggest that even though this was the slowest connection it was not a bad connection as the time likely falls within expected test variance. Always keep scales in mind when analysing LinkTest results.
Note that if the test were a bisection test, serially retested connections are only retested in a point-to-point fashion. As a result numbers are not comparable.
Note that the worst serially retested connections are always displayed first.
Serially retesting the worst connections is very useful to see if bad connections are either random, or maybe due to the act of measuring, or if the connections are actually bad, which would be the case if the times did not improve to expected values.
The Version Footer
At the bottom of the page you will find the version footer. On the left the LinkTest version used to generate the data is indicated. On the right the tool and its version as well as its distributor are indicated.
Troubleshoot
If you experience issues related to tkinter
from python when generating the PDF report your plotting backend may have difficulty generating the PDF file, espeacially if no (virtual) displays are available. In this case explicitly set the plotting backend to Agg
via export MPLBACKEND=agg
. If this backend is not available try other backends.