- Introduction
- The Metadata
- The Timing Data
- Correcting Old File-Format Bugs
- Generating Reports from within Python
Introduction
LinkTest is bundled together with a python package, called linktest
(no capitalisation), to enable reading of LinkTest SION files. This is generally installed during a default build.
The linktest
package facilitates working with LinkTest data by providing a C reader for the generated SION files. Note that only uncompressed SION files are supported. It is also explicitly assumed that the system used to generate the LinkTest data to be read and the one reading the data use the same representation for floating point numbers, their endianess in terms of big and little endian may differ.
Only core Python 3.8.5 or above is required to use this package. That said NumPy and MatPlotLib are helpful respectively for working with and the visualisation of LinkTest data. More on how this can be done in the examples
directory, see Example.ipynb
.
The reader can be used as follows to load the SION file called linktest.sion
:
import linktest #Imports the linktest package
metaData,timingData=linktest.readSION("linktest.sion") #Read the linktest.sion SION file
metaData
and timingData
are both dictionaries. metaData
contains meta data that describe the LinkTest run that generated the data. timingData
contains the actually measured times as well as additional connection data generated during the run.
The Metadata
The metaData
dictionary contains a plethora of information with regard to how the run was executed and what options were used. The dictionary contains the following:
-
Version
: The three part version number of the LinkTest executable that generated the data as a string. The first number indicates the major version number, the second number is the minor version number and the third number is the patch level. This is very useful to have a look at to ascertain if LinkTest bugs may be responsible for seemingly erroneous timing results. Currently only SION files produced by LinkTest versions 2.0.0 and 2.1.(1,2,4--11) are supported. -
Write timestamp: Year
: The year on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Write timestamp: Month
: The month on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Write timestamp: Day
: The day on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Write timestamp: Hour
: The hour on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Write timestamp: Minute
: The minute on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Write timestamp: Second
: The second on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Write timestamp: Timezone
: The timezone and time-coordinate type configured on the system when saving of the LinkTest output SION file began as a string. Acronyms used here are system specific and their meaning is not stored in LinkTest, please consult the manuals for the system on which LinkTest was ran to produce the data under investigation. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed. -
Mode
: The communication API used when LinkTest was ran. This was user configurable at the time of execution. See Communication API in the Glossary. -
Test MPI all-to-all?
: An integer that if non-zero indicates that additional all-to-all testing was performed before and after the main run that generated the data under investigation. See All-to-All Testing in the Glossary. -
Do bidirectional tests?
: An integer that if non-zero indicates that bidirectional testing was performed during the main part of the LinkTest run that generated the data under investigation. If zero another form of testing was performed, either unidirectional or semi-directional testing. See Bidirectional Testing in the Glossary. This boolean is mutually exclusive with the next boolean. -
Do unidirectional tests?
: An integer that if non-zero indicates that unidirectional test was performed during the main part of the LinkTest run that generated the data under investigation. If zero another form of testing was performed, either bidirectional or semi-directional testing. See Unidirectional Testing in the Glossary. This boolean is mutually exclusive with the previous boolean. -
Do a bisection test?
: An integer that if non-zero indicates that a bisection test was performed to generate the data under investigation. See Bisection Testing in the Glossary. -
Randomize test order?
: An integer that if non-zero indicates that the testing order was randomized before the data under investigation was generated. See Randomizing Testing Order in the Glossary. -
Test serially?
: An integer that if non-zero indicates that the testing was done in serial to generate the data under investigation. See Serial Testing in the Glossary. -
Store results in a SION file?
: An integer that if non-zero indicates that the data under investigation was written to a SION file. This value should always be non-zero. -
Write SION output in parallel?
: An integer that if non-zero indicates that the data under investigation was written to a SION file in parallel. This was an option when the LinkTest run that generated the data under investigation was executed. -
Store messages in GPU memory?
: An integer that if non-zero indicates that messages used during the LinkTest run that generated the data under investigation was stored in GPU memory. In the case where multiple GPUs on a given host were used which GPU the messages for a given LinkTest task were stored on cannot be determined from the SION file. If zero the messages were stored elsewhere, likely in CPU RAM. See CPU RAM and GPU RAM in the Glossary. -
Use multiple memory buffers?
: An integer that if non-zero indicates that multiple buffers were used for sending and receiving messages. If non-zero then the number of multiple buffers can be found underNumber of multiple buffers
. See Multiple Buffers in the Glossary. -
Randomize memory-buffer content?
: An integer that if non-zero indicates that buffer content was randomized before messages were sent and received. If non-zero the Mersenne Twister seed is stored underMersenne Twister seed
. See Buffer Randomization in the Glossary. -
Check memory-buffer content?
: An integer that if non-zero indicates that the buffer content was checked for consistency after each test of a pair of connections. See Checking Memory-Buffer Content in the Glossary. -
Number of messages
: An integer indicating the number of messages timing was averaged over to generate the data under investigation. See Number Of Messages in the Glossary in the Glossary. -
Message Size
: An integer indicating the message size in Bytes used for timing to generate the data under investigation. See Message Size in the Glossary in the Glossary. -
Number of warm-up messages
: An integer indicating the number of warm-up message exchanged before timing began to generate the data under investigation. See Number of Warm-Up Messages in the Glossary in the Glossary. -
collectPNum
: An legacy integer kept for backwards compatibility. -
Number of serial retests
: An integer indicating the number of serial retests of the worst connections after the main part of the LinkTest run to generate the data under investigation. See Serial Retesting in the Glossary. -
Allocator type
: An string indicating the type of memory-buffer allocator used. See Memory-Buffer Allocator in the Glossary. -
Number of multiple buffers
: An integer indicating the number of multiple buffers used. Only present ifUse multiple memory buffers?
is non-zero. See Multiple Buffers in the Glossary. -
Mersenne Twister seed
: An integer indicating the initial seed for the Mesenne Twister PRNG. Only present ifRandomize memory-buffer content?
is non-zero. See Buffer Randomization in the Glossary. -
Minimum time
: A double-precision floating-point number indicating the shortest average time it took two LinkTest tasks to exchange messages during the LinkTest run that generated the data under investigation. -
Maximum time
: A double-precision floating-point number indicating the largest average time it took two LinkTest tasks to exchange messages during the LinkTest run that generated the data under investigation. -
Averagetime
: A double-precision floating-point number indicating the the global average time it took two LinkTest tasks to exchange messages during the LinkTest run that generated the data under investigation. -
Minimum all-to-all time
: A double-precision floating-point number indicating the shortest average time it took to perform a all-to-all communication on a LinkTest task during the LinkTest run that generated the data under investigation. This value is only available if the metadata dictionary elementdoAlltoall
is non-zero. -
Maximum all-to-all time
: A double-precision floating-point number indicating the longest average time it took to perform a all-to-all communication on a LinkTest task during the LinkTest run that generated the data under investigation. This value is only available if the metadata dictionary elementdoAlltoall
is non-zero. -
Average all-to-all time
: A double-precision floating-point number indicating the global average time it took to perform a all-to-all communication on a LinkTest task during the LinkTest run that generated the data under investigation. This value is only available if the metadata dictionary elementdoAlltoall
is non-zero.
At this time some of the names are not optimal and will likely be changed in the future, for the moment they, however, are kept for backwards compatibility.
The Timing Data
This timingData
dictionary contains all the timing related data. Some data is only available depending on the metadata, like all-to-all times are only present if the metadata-dictionary element doAlltoall
is non-zero. The timingData
dictionary contains the following entries:
-
Hosts
: A list of byte literals, can be viewed as a type of python string, at most 255 characters long listing the host names in order of task number. Repeating host names indicate that multiple LinkTest tasks were run on a given host. How these tasks were pinned on the host is not recorded by LinkTest. -
Timings
: A two-dimensional double-floating-point-precision array, you can think of it as a matrix, of the timing data recorded by LinkTest. Each row corresponds to a sending task, each column to a receiving task. Times correspond to the average time taken for the configured test for the corresponding source-receiver pair. NaN values indicate that the respective connection was not tested. -
Access Pattern
: A two-dimensional integer array, you can think of it as a matrix. Each row corresponds to a sending task, each column to a receiving task. The integer value corresponds to the step in which the connection between the sending and receiving task was tested. A zero value indicates that the connection was never tested. For example the top-left to bottom-right matrix diagonal should always be zero as connections between a task and itself are not tested. -
New serially retested timings
: A one-dimensional double-floating-point-precision array of the times found after serially retesting the slowest connections. The corresponding old times before serially retesting can be found inserialTimingsOld
. The corresponding connection partners can be found inserialFrom
(source) andserialTo
(receiver). NaN or 0.0 values originating from older versions of LinkTest indicate that a certain connection was not tested. In this case the same element inserialTimingsOld
has the same value and bothserialFrom
andserialTo
are 0 for this element. -
Old serially retested timings
: A one-dimensional double-floating-point-precision array of the times found after serially retesting the slowest connections. The corresponding new times after serially retesting can be found inserialTimingsNew
. The corresponding connection partners can be found inserialFrom
(source) andserialTo
(receiver). NaN or 0.0 values originating from older versions of LinkTest indicate that a certain connection was not tested. In this case the same element inserialTimingsNew
has the same value and bothserialFrom
andserialTo
are 0 for this element. -
Serial-retest origin hosts
: A one-dimensional integer array of the source task for the serial retesting of the slowest connections. The corresponding connection partners can be found inserialTo
. The associated timing data before and after serial retesting of the connection can be found inserialTimingsOld
andserialTimingsNew
respectively. 0 values ifserialTo
is also 0 for said element originating from older LinkTest versions indicate that said connection was not tested.serialTimingsOld
andserialTimingsNew
should then also be 0.0 or NaN for said element, depending on the LinkTest version. -
Serial-retest destination hosts
: A one-dimensional integer array of the receiving task for the serial retesting of the slowest connections. The corresponding connection partners can be found inserialFrom
. The associated timing data before and after serial retesting of the connection can be found inserialTimingsOld
andserialTimingsNew
respectively. 0 values ifserialFrom
is also 0 for said element originating from older LinkTest versions indicate that said connection was not tested.serialTimingsOld
andserialTimingsNew
should then also be 0.0 or NaN for said element, depending on the LinkTest version. -
All-to-all timings
: A one-dimensional double-floating-point-precision array of the all-to-all times fo each host. Times correspond to the average time it took the host to complete n iterations of all-to-all communication, where n number of messages used for testing.
At this time some of the names are not optimal and will likely be changed in the future, for the moment they, however, are kept for backwards compatibility.
Correcting Old File-Format Bugs
The reader will silently fix bugs present in the SION files of older versions when loading the data. For example in older versions the average times were not correctly stored, they should have been multiplied by a factor. When reading files with this issue the reader silently corrects these values based on the information present. This assumes that the SION file was not generated from a LinkTest stress test, something that is no longer possible in recent versions of LinkTest.
Generating Reports from within Python
After loading a SION file into python it can also be used to generate reports, see LinkTest Report. The LinkTest module also includes the function linktest_report
, with the following definition:
report(metaData, #LinkTest metadata
timingData, #LinkTest timing data
output_report_filename, #Report name and path, including file extension
title_string="", #Report tile
cutoffs=[None,None], #Timing cutoffs for clipping timings
domain="", #Domain to remove from host names
downsampling_factor_matrix_ticks=1, #Y-axis down sampling factor
verbose=False, #If true prints prints to terminal report progress
matrix_colormap="gist_rainbow_r", #MatPlotLib name for matrix colourmap
histogram_colormap="gist_rainbow_r", #MatPlotLib name for histogram
replace_slow=False #If true retested timings replace original times
):
where the inputs are as follows:
-
metaData
: The metadata dictionary for the LinkTest data. It should be in the same format as the first output fromreadSION()
. This input is required. -
timingData
: The timing-data dictionary for the LinkTest data. It should be in the same format as the second output fromreadSION()
. This input is required. -
output_report_filename
: A python string specifying the path and filename of the to be generated report. The file extension after the last dot indicates the file type, as such a file extension is always necessary. This input is required. -
title_string
: A python string indicating the title of the report. -
cutoffs
: A two element array containing a minimum (1st entry) and maximum (2nd entry) time. Outside of this range values will be clipped. For entries that areNone
the value will be set either to the minimum time for the 1st entry or the maximum time for the second entry. -
domain
: A python string to remove from host names to simplify their display next to the indexed-colour image. -
downsampling_factor_matrix_ticks
: The minimum downsampling factor to use when displaying node names along the y-axis of the indexed-colour plot. -
verbose
: If the input evaluates to true then timing information is printed to the terminal. -
matrix_colormap
: The name of the MatPlotLib colourmap to use for the indexed-colour image. -
histogram_colormap
: The name of the MatPlotLib colourmap to use for the histogram. -
replace_slow
: If the input evaluates to true then before anything is calculated the times from the retests of slow connections are used to replace the timings from the main test if they are smaller.
You can generate a report in Python using this function as follows:
import linktest #Imports the linktest package
metaData,timingData=linktest.readSION("linktest.sion") #Read the linktest.sion SION file
linktest.report(metaData,timingData,"Title") #Generate Report