|
|
[[_TOC_]]
|
|
|
|
|
|
# Introduction
|
|
|
|
|
|
LinkTest is bundled together with a python package, called `linktest` (no capitalisation), to enable reading of LinkTest SION files. This is generally installed during a default build.
|
|
|
|
|
|
The `linktest` package facilitates working with LinkTest data by providing a C reader for the generated SION files. Note that only uncompressed SION files are supported. It is also explicitly assumed that the system used to generate the LinkTest data to be read and the one reading the data use the same representation for floating point numbers, their endianess in terms of big and little endian may differ.
|
|
|
|
|
|
Only core Python 3.8.5 or above is required to use this package. That said NumPy and MatPlotLib are helpful respectively for working with and the visualisation of LinkTest data. More on how this can be done in the `examples` directory, see [`Example.ipynb`](https://gitlab.jsc.fz-juelich.de/cstao-public/linktest/-/blob/master/examples/Example.ipynb).
|
|
|
|
|
|
The reader can be used as follows to load the SION file called `linktest.sion`:
|
|
|
```python
|
|
|
import linktest #Imports the linktest package
|
|
|
metaData,timingData=linktest.readSION("linktest.sion") #Read the linktest.sion SION file
|
|
|
```
|
|
|
|
|
|
`metaData` and `timingData` are both dictionaries. `metaData` contains meta data that describe the LinkTest run that generated the data. `timingData` contains the actually measured times as well as additional connection data generated during the run.
|
|
|
|
|
|
# The Metadata
|
|
|
The `metaData` dictionary contains a plethora of information with regard to how the run was executed and what options were used. The dictionary contains the following:
|
|
|
|
|
|
- `Version`: The three part version number of the LinkTest executable that generated the data as a string. The first number indicates the major version number, the second number is the minor version number and the third number is the patch level. This is very useful to have a look at to ascertain if LinkTest bugs may be responsible for seemingly erroneous timing results. Currently only SION files produced by LinkTest versions 2.0.0 and 2.1.(1,2,4--11) are supported.
|
|
|
|
|
|
- `Write timestamp: Year`: The year on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Write timestamp: Month`: The month on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Write timestamp: Day`: The day on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Write timestamp: Hour`: The hour on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Write timestamp: Minute`: The minute on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Write timestamp: Second`: The second on the system when saving of the LinkTest output SION file began as an integer. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Write timestamp: Timezone`: The timezone and time-coordinate type configured on the system when saving of the LinkTest output SION file began as a string. Acronyms used here are system specific and their meaning is not stored in LinkTest, please consult the manuals for the system on which LinkTest was ran to produce the data under investigation. This in conjunction with the other time and date data are useful in determining approximately when the LinkTest run that generated the data was executed.
|
|
|
|
|
|
- `Mode`: The communication API used when LinkTest was ran. This was user configurable at the time of execution. See [Communication API in the Glossary](Glossary#communication-aPI).
|
|
|
|
|
|
- `Test MPI all-to-all?`: An integer that if non-zero indicates that additional all-to-all testing was performed before and after the main run that generated the data under investigation. See [All-to-All Testing in the Glossary](Glossary#all-to-all-testing).
|
|
|
|
|
|
- `Do bidirectional tests?`: An integer that if non-zero indicates that bidirectional testing was performed during the main part of the LinkTest run that generated the data under investigation. If zero another form of testing was performed, either unidirectional or semi-directional testing. See [Bidirectional Testing in the Glossary](Glossary#bidirectional-testing). This boolean is mutually exclusive with the next boolean.
|
|
|
|
|
|
- `Do unidirectional tests?`: An integer that if non-zero indicates that unidirectional test was performed during the main part of the LinkTest run that generated the data under investigation. If zero another form of testing was performed, either bidirectional or semi-directional testing. See [Unidirectional Testing in the Glossary](Glossary#unidirectional-testing). This boolean is mutually exclusive with the previous boolean.
|
|
|
|
|
|
- `Do a bisection test?`: An integer that if non-zero indicates that a bisection test was performed to generate the data under investigation. See [Bisection Testing in the Glossary](Glossary#bisection-testing).
|
|
|
|
|
|
- `Randomize test order?`: An integer that if non-zero indicates that the testing order was randomized before the data under investigation was generated. See [Randomizing Testing Order in the Glossary](Glossary#randomizing-testing-order).
|
|
|
|
|
|
- `Test serially?`: An integer that if non-zero indicates that the testing was done in serial to generate the data under investigation. See [Serial Testing in the Glossary](Glossary#serial-testing).
|
|
|
|
|
|
- `Store results in a SION file?`: An integer that if non-zero indicates that the data under investigation was written to a SION file. This value should always be non-zero.
|
|
|
|
|
|
- `Write SION output in parallel?`: An integer that if non-zero indicates that the data under investigation was written to a SION file in parallel. This was an option when the LinkTest run that generated the data under investigation was executed.
|
|
|
|
|
|
- `Store messages in GPU memory?`: An integer that if non-zero indicates that messages used during the LinkTest run that generated the data under investigation was stored in GPU memory. In the case where multiple GPUs on a given host were used which GPU the messages for a given LinkTest task were stored on cannot be determined from the SION file. If zero the messages were stored elsewhere, likely in CPU RAM. See [CPU RAM](Glossary#cpu-ram) and [GPU RAM](Glossary#gpu-ram) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Use multiple memory buffers?`: An integer that if non-zero indicates that multiple buffers were used for sending and receiving messages. If non-zero then the number of multiple buffers can be found under `Number of multiple buffers`. See [Multiple Buffers](Glossary#multiple-buffers) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Randomize memory-buffer content?`: An integer that if non-zero indicates that buffer content was randomized before messages were sent and received. If non-zero the Mersenne Twister seed is stored under `Mersenne Twister seed`. See [Buffer Randomization](Glossary#buffer-randomization) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Check memory-buffer content?`: An integer that if non-zero indicates that the buffer content was checked for consistency after each test of a pair of connections. See [Checking Memory-Buffer Content](Glossary#checking-memory-buffer-content) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Number of messages`: An integer indicating the number of messages timing was averaged over to generate the data under investigation. See [Number Of Messages in the Glossary](Glossary#number-of-messages) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Message Size`: An integer indicating the message size in Bytes used for timing to generate the data under investigation. See [Message Size in the Glossary](Glossary#message-size) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Number of warm-up messages`: An integer indicating the number of warm-up message exchanged before timing began to generate the data under investigation. See [Number of Warm-Up Messages in the Glossary](Glossary#number-of-warm-up-messages) in the [Glossary](Glossary).
|
|
|
|
|
|
- `collectPNum`: An legacy integer kept for backwards compatibility.
|
|
|
|
|
|
- `Number of serial retests`: An integer indicating the number of serial retests of the worst connections after the main part of the LinkTest run to generate the data under investigation. See [Serial Retesting](Glossary#serial-retesting) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Allocator type`: An string indicating the type of memory-buffer allocator used. See [Memory-Buffer Allocator](Glossary#memory-buffer-allocator) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Number of multiple buffers`: An integer indicating the number of multiple buffers used. Only present if `Use multiple memory buffers?` is non-zero. See [Multiple Buffers](Glossary#multiple-buffers) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Mersenne Twister seed`: An integer indicating the initial seed for the Mesenne Twister PRNG. Only present if `Randomize memory-buffer content?` is non-zero. See [Buffer Randomization](Glossary#buffer-randomization) in the [Glossary](Glossary).
|
|
|
|
|
|
- `Minimum time`: A double-precision floating-point number indicating the shortest average time it took two LinkTest tasks to exchange messages during the LinkTest run that generated the data under investigation.
|
|
|
|
|
|
- `Maximum time`: A double-precision floating-point number indicating the largest average time it took two LinkTest tasks to exchange messages during the LinkTest run that generated the data under investigation.
|
|
|
|
|
|
- `Averagetime`: A double-precision floating-point number indicating the the global average time it took two LinkTest tasks to exchange messages during the LinkTest run that generated the data under investigation.
|
|
|
|
|
|
- `Minimum all-to-all time`: A double-precision floating-point number indicating the shortest average time it took to perform a all-to-all communication on a LinkTest task during the LinkTest run that generated the data under investigation. This value is only available if the metadata dictionary element `doAlltoall` is non-zero.
|
|
|
|
|
|
- `Maximum all-to-all time`: A double-precision floating-point number indicating the longest average time it took to perform a all-to-all communication on a LinkTest task during the LinkTest run that generated the data under investigation. This value is only available if the metadata dictionary element `doAlltoall` is non-zero.
|
|
|
|
|
|
- `Average all-to-all time`: A double-precision floating-point number indicating the global average time it took to perform a all-to-all communication on a LinkTest task during the LinkTest run that generated the data under investigation. This value is only available if the metadata dictionary element `doAlltoall` is non-zero.
|
|
|
|
|
|
At this time some of the names are not optimal and will likely be changed in the future, for the moment they, however, are kept for backwards compatibility.
|
|
|
|
|
|
# The Timing Data
|
|
|
|
|
|
This `timingData` dictionary contains all the timing related data. Some data is only available depending on the metadata, like all-to-all times are only present if the metadata-dictionary element `doAlltoall` is non-zero. The `timingData` dictionary contains the following entries:
|
|
|
|
|
|
- `Hosts`: A list of byte literals, can be viewed as a type of python string, at most 255 characters long listing the host names in order of task number. Repeating host names indicate that multiple LinkTest tasks were run on a given host. How these tasks were pinned on the host is not recorded by LinkTest.
|
|
|
|
|
|
- `Timings`: A two-dimensional double-floating-point-precision array, you can think of it as a matrix, of the timing data recorded by LinkTest. Each row corresponds to a sending task, each column to a receiving task. Times correspond to the average time taken for the configured test for the corresponding source-receiver pair. NaN values indicate that the respective connection was not tested.
|
|
|
|
|
|
- `Access Pattern`: A two-dimensional integer array, you can think of it as a matrix. Each row corresponds to a sending task, each column to a receiving task. The integer value corresponds to the step in which the connection between the sending and receiving task was tested. A zero value indicates that the connection was never tested. For example the top-left to bottom-right matrix diagonal should always be zero as connections between a task and itself are not tested.
|
|
|
|
|
|
- `New serially retested timings`: A one-dimensional double-floating-point-precision array of the times found after serially retesting the slowest connections. The corresponding old times before serially retesting can be found in `serialTimingsOld`. The corresponding connection partners can be found in `serialFrom` (source) and `serialTo` (receiver). NaN or 0.0 values originating from older versions of LinkTest indicate that a certain connection was not tested. In this case the same element in `serialTimingsOld` has the same value and both `serialFrom` and `serialTo` are 0 for this element.
|
|
|
|
|
|
- `Old serially retested timings`: A one-dimensional double-floating-point-precision array of the times found after serially retesting the slowest connections. The corresponding new times after serially retesting can be found in `serialTimingsNew`. The corresponding connection partners can be found in `serialFrom` (source) and `serialTo` (receiver). NaN or 0.0 values originating from older versions of LinkTest indicate that a certain connection was not tested. In this case the same element in `serialTimingsNew` has the same value and both `serialFrom` and `serialTo` are 0 for this element.
|
|
|
|
|
|
- `Serial-retest origin hosts`: A one-dimensional integer array of the source task for the serial retesting of the slowest connections. The corresponding connection partners can be found in `serialTo`. The associated timing data before and after serial retesting of the connection can be found in `serialTimingsOld` and `serialTimingsNew` respectively. 0 values if `serialTo` is also 0 for said element originating from older LinkTest versions indicate that said connection was not tested. `serialTimingsOld` and `serialTimingsNew` should then also be 0.0 or NaN for said element, depending on the LinkTest version.
|
|
|
|
|
|
- `Serial-retest destination hosts`: A one-dimensional integer array of the receiving task for the serial retesting of the slowest connections. The corresponding connection partners can be found in `serialFrom`. The associated timing data before and after serial retesting of the connection can be found in `serialTimingsOld` and `serialTimingsNew` respectively. 0 values if `serialFrom` is also 0 for said element originating from older LinkTest versions indicate that said connection was not tested. `serialTimingsOld` and `serialTimingsNew` should then also be 0.0 or NaN for said element, depending on the LinkTest version.
|
|
|
|
|
|
- `All-to-all timings`: A one-dimensional double-floating-point-precision array of the all-to-all times fo each host. Times correspond to the average time it took the host to complete n iterations of all-to-all communication, where n number of messages used for testing.
|
|
|
|
|
|
At this time some of the names are not optimal and will likely be changed in the future, for the moment they, however, are kept for backwards compatibility.
|
|
|
|
|
|
# Correcting Old File-Format Bugs
|
|
|
The reader will silently fix bugs present in the SION files of older versions when loading the data. For example in older versions the average times were not correctly stored, they should have been multiplied by a factor. When reading files with this issue the reader silently corrects these values based on the information present. This assumes that the SION file was not generated from a LinkTest stress test, something that is no longer possible in recent versions of LinkTest.
|
|
|
|
|
|
# Generating Reports from within Python
|
|
|
After loading a SION file into python it can also be used to generate reports, see [LinkTest Report](LinkTest-Report). The LinkTest module also includes the function `linktest_report`, with the following definition:
|
|
|
```python
|
|
|
report(metaData, #LinkTest metadata
|
|
|
timingData, #LinkTest timing data
|
|
|
output_report_filename, #Report name and path, including file extension
|
|
|
title_string="", #Report tile
|
|
|
cutoffs=[None,None], #Timing cutoffs for clipping timings
|
|
|
domain="", #Domain to remove from host names
|
|
|
downsampling_factor_matrix_ticks=1, #Y-axis down sampling factor
|
|
|
verbose=False, #If true prints prints to terminal report progress
|
|
|
matrix_colormap="gist_rainbow_r", #MatPlotLib name for matrix colourmap
|
|
|
histogram_colormap="gist_rainbow_r", #MatPlotLib name for histogram
|
|
|
replace_slow=False #If true retested timings replace original times
|
|
|
):
|
|
|
```
|
|
|
where the inputs are as follows:
|
|
|
|
|
|
- `metaData`: The metadata dictionary for the LinkTest data. It should be in the same format as the first output from `readSION()`. This input is required.
|
|
|
- `timingData`: The timing-data dictionary for the LinkTest data. It should be in the same format as the second output from `readSION()`. This input is required.
|
|
|
- `output_report_filename`: A python string specifying the path and filename of the to be generated report. The file extension after the last dot indicates the file type, as such a file extension is always necessary. This input is required.
|
|
|
- `title_string`: A python string indicating the title of the report.
|
|
|
- `cutoffs`: A two element array containing a minimum (1st entry) and maximum (2nd entry) time. Outside of this range values will be clipped. For entries that are `None` the value will be set either to the minimum time for the 1st entry or the maximum time for the second entry.
|
|
|
- `domain`: A python string to remove from host names to simplify their display next to the indexed-colour image.
|
|
|
- `downsampling_factor_matrix_ticks`: The minimum downsampling factor to use when displaying node names along the y-axis of the indexed-colour plot.
|
|
|
- `verbose`: If the input evaluates to true then timing information is printed to the terminal.
|
|
|
- `matrix_colormap`: The name of the MatPlotLib colourmap to use for the indexed-colour image.
|
|
|
- `histogram_colormap`: The name of the MatPlotLib colourmap to use for the histogram.
|
|
|
- `replace_slow`: If the input evaluates to true then before anything is calculated the times from the retests of slow connections are used to replace the timings from the main test if they are smaller.
|
|
|
|
|
|
You can generate a report in Python using this function as follows:
|
|
|
```python
|
|
|
import linktest #Imports the linktest package
|
|
|
metaData,timingData=linktest.readSION("linktest.sion") #Read the linktest.sion SION file
|
|
|
linktest.report(metaData,timingData,"Title") #Generate Report
|
|
|
``` |