Usage
LinkTest has to be started in parallel, with an even number of proccesses for example using srun --ntasks 2 linktest.
You can control its execution via the following command-line arguments:
-h
or --help
: Prints a help message similar to the following:
You can check the usage via linktest -h
(even without srun), which should look similar to this
Version : <<<VERSION>>>
Usage : linktest [options]
Possible options (default values in parathesis):
-h/--help Print this help message and exit
-v/--version Print LinkTest version and exit
-m/--mode VAL Transport Layer to be used [REQUIRED]*
-w/--num-warmup-messages VAL Number of warm-up messages [REQUIRED]
-n/--num-messages VAL Number of messages [REQUIRED]
-s/--size-messages VAL Message size in bytes [REQUIRED]
-o/--output VAL output file name (pingpong_results_bin.sion)
--no-sion-file Do not write data to sion file (0)
--parallel-sion-file Write data SION file in parallel (0)
--num-slowest VAL Number of slowest pairs to be retested (10)
--min-iterations VAL LinkTest repeats for at least <min_iterations> (1)
--min-runtime VAL LinkTest runs for at least <min_runtime> seconds communication time (0.0)
--memory-buffer-allocator VAL Allocator type for memory buffers (DEFAULT)
--all-to-all Additionally perform MPI all-to-all tests (0)
--unidirectional Perform unidirectional tests (0)
--bidirectional Perform bidirectional tests (0)
--bisection Test between bisecting halves (0)
--serial-tests Serialize tests (0)
--randomize-steps Randomize step execution order (0)
--seed-randomize-steps VAL Seed for step randomization (9876543210)
--use-gpu-memory Use GPU memory to store message buffers (0)
--multi-buffer Use multiple send and receive buffers (0)
--num-multi-buffer VAL Number of buffers when using multiple buffers (0)
--randomize-buffers Randomize buffers (0)
--seed-randomize-buffers VAL Seed for buffer randomization (1234567890)
--check-buffers Check buffers after timing kernel (0)
--num-randomize-tasks VAL Use VAL different randomly assigned process IDs for communication scheme (0)
--seed-randomize-tasks VAL Seed for task randomization (29309775)
--group-processes-by-hostname Group processes by hostnames and only test group to group (0)
* This build supports [<<<SUPPORTED COMMUNICATION APIs>>>].
Alternatively to --mode, the transport layer can be defined by using linktest.LAYER
or setting environment variable LINKTEST_VCLUSTER_IMPL
where <<<VERSION>>>
is the three part version of LinkTest executable and <<<SUPPORTED COMMUNICATION APIs>>>
is a list of support communication APIs/Layers. This option supersedes all others. When executing LinkTest with this command-line option it does not need to be run in parallel.
-v
or --version
: Prints the following version information:
FZJ Linktest (<<<VERSION>>>)
where <<<VERSION>>>
is the three part version of LinkTest executable. Like the -h
or --help
option LinkTest does not need to be executed with this option. This option supersedes all other aside from the -h
or --help
option.
-m
or --mode
: Specifies that the following ASCII string indicates the communication API to use for testing. Alternatively the communication API can be extracted from the extension of the LinkTest executable name or from the LINKTEST_VCLUSTER_IMPL
environment variable. When multiple ways of specifying the communication API are used then -m
or --mode
supersedes the LinkTest executable extension, which in turn also supersedes the LINKTEST_VCLUSTER_IMPL
environment variable.
-w
or --num-warmup-messages
: Specifies that the following integer indicates the number of warm-up messages to use to warm up a connection before testing it. When not printing help or version information this command-line argument is required.
-n
or --num-messages
: Specifies that the following integer indicates the number of messages measurements should be averaged over during testing. When not printing help or version information this command-line argument is required.
-s
or --size-messages
: Specifies that the following integer indicates the message size in bytes for testing. When not printing help or version information this command-line argument is required.
-o
or --output
: Specifies that the following string indicates the filename of the output SION file.
--no-sion-file
: Specifies that the collected results should not be written out into a SION file.
--parallel-sion-file
: Specifies that the collected results should be written out into a SION file in parallel if writing is enabled.
--num-slowest
: Specifies that the following integer indicates the number of slowest connections to serially retest after the end of the main test.
--min-iterations
: Specifies that the following integer indicates the number of times the LinkTest benchmark should be repeated. If not one the writing of SION files is disabled. This command-line argument is useful to apply a communication load to the system.
--min-runtime
: Specifies that the following floating-point--precision number indicates the number of seconds that LinkTest should repeat itself for. If non-zero the writing of SION files is disabled. This command-line is useful to apply a communication load to the system.
--memory_buffer_allocator
: Specifies that the following string indicates the memory buffer allocator type to be used for allocating the memory buffers for sending and receiving data. The following options are permitted:
memory_buffer_allocator |
Description |
---|---|
DEFAULT |
The default allocator, either the POSIX_aligned-memory_allocator , if possible, otherwise the aligned-memory_allocator or the CUDA_memory_allocator if testing the CUDA API. |
aligned-memory_allocator |
Uses aligned_alloc to allocate buffers on a page boundary. |
pinned-memory-map_allocator |
Uses mmap to allocate buffers on a page boundary. |
POSIX_aligned-memory_allocator |
Uses posix_memalign to allocate buffers on a page boundary. |
CUDA_memory_allocator |
Uses CUDA memalloc to allocate memory on the GPU. |
--all-to-all
: Specifies that all-to-all testing should be done before and after the main LinkTest test if the used communication API is MPI.
--unidirectional
: Specifies that testing should occur unidirectionally instead of semi-directionally, which is the default. Only communication-API MPI is currently supported.
--bidirectional
: Specifies that testing should occur bidirectionally instead of semi-directionally, which is the default. Cannot be used in conjunction with --unidirectional
.
--bisection
: Specifies that the tasks for testing should be split in two halves and that testing should only occur between these two. This is useful for determining bisection bandwidths. For more information see Communication Patterns.
--serial-tests
: Specifies that connections should be tested in serial. By default testing occurs in parallel.
--randomize
: Specifies that the step order in which tests are performed is to be randomized.
--seed-randomize-steps
: Specifies that the following integer is to be used as a seed for step randomization. This option is only important if --randomize
is specified. The seed value can be between 1 and 2^32-1.
--use-gpu-memory
: Specifies that GPU memory should be used for the message buffers.
--multi_-buffers_buf
: Specifies that multiple message buffers should be used for sending and receiving messages. Currently only works in conjunction with MPI and unidirectional testing.
--num_multi_buf
: Specifies that the following integer indicates the number of multiple buffers to be used. Should be between 2 and the maximum of the number of messages and warm-up messages.
--randomize-buffers
: Specifies that the buffers should be randomized before sending and receiving. Randomization is done using the Mersenne Twister 19937 algorithm, which has a period of 2^19937-1. Currently does not work for GPU buffers.
--seed-randomize-buffers
: Specifies that the following integer is to be used as a seed for buffer randomization. This option is only important if --randomize-buffers
is specified. The seed value can be between 1 and 2^32-1. Currently does not work for GPU buffers.
check_buffers
: Specifies that buffers should be checked after each step. This only detects errors if during the last time the buffer was written to the buffer was corrupted. If one of the messages in the middle is incorrectly transferred this will not detect it.
--num-randomize-tasks
: Specifies that the following interger indicates the number of times the test should be repeated for random permutations of the process IDs. If 0
then the process IDs are not randomized.
--seed-randomize-tasks
: Specifies that the following integer is to be used as a seed for process-ID randomization. This option is only important if the value of --num-randomize-tasks
is nonzero. The seed value can be between 1 and 2^32-1.
--group-processes-by-hostname
: Specifies that process IDs should be grouped according to their hostname. Testing then exclusively occurs between group pairs. When testing a group pair all possible connection pairs of the processes belonging to the two groups are iterated through while ensuring that communication only happens between the two groups. This is done for all possible group pairs. If --bisection
is also specified then the group of groups is split into bisecting halves and tests only occur between the two halves. For more information see Communication Patterns.
The arguments --num-warmup-messages
, --num-messages
& =size-messages
are required. The transport layer is usually given through the --mode
option. In rare cases where this doesn't work, you can fall back to the linktest.LAYER executables, and/or set the environment variable LINKTEST_VCLUSTER_IMPL
.
# Option 1: Using mode to specify the virtual-cluster implementation
srun --ntasks 4 \
linktest \
--mode mpi \
--num-warmup-messages 10 \
--num-messages 100 \
--size-messages $((16*1024*1024));
# Option 2: Using a linktest executable with a suffix
srun --ntasks 4 \
linktest.mpi \
--num-warmup-messages 10 \
--num-messages 100 \
--size-messages $((16*1024*1024));
# Option 3: Using the LINKTEST_VCLUSTER_IMPL enviroment variable
export LINKTEST_VCLUSTER_IMPL=mpi;
srun --ntasks 4 \
linktest \
--num-warmup-messages 10 \
--num-messages 100 \
--size-messages $((16*1024*1024));
Except for the MPI and the node-internal CUDA transport layer, all layers utilize the TCP sockets implementation underneath for setup and exchange of data in non-benchmark code segments. The TCP layer implementation uses a lookup of the hostname of the node to determine the IPs for the initial connection setup. There are currently only limited methods to customize this behavior. The code supports the option to set LINKTEST_SYSTEM_NODENAME_SUFFIX
as a suffix to be added to the short hostname. For example, on JSC systems, LINKTEST_SYSTEM_NODENAME_SUFFIX=i
may need to be exported to make sure the out-of-band connection setup is done via the IPoIB network.
With any transport layer but MPI or intra-node CUDA it is important to make sure that the PMI (not MPI) environment is correctly set up. The easiest way to achieve this using slurm is: srun --mpi=pmi2
or srun --mpi=pmix
. If this option is not available or not supported by slurm please consult the relevant PMI documentation for your system.
Supported Combinations of Communication APIs & Various Options
Not all option combinations are currently possible. The following table shows supported combinations.
Option | MPI | TCP | UCP/IBVerbs/PSM2/Portals | CUDA |
---|---|---|---|---|
Unidirectional | ||||
Semidirectional | ||||
Bidirectional | ||||
Bisection | ||||
Hostname Grouping | ||||
All-to-All | ||||
Randomize | ||||
Serial Tests | ||||
No SION File | ||||
Parallel SION File | ||||
Use GPU Memory |
|
|||
Min. Iterations | ||||
Min. Runtime | ||||
Use Multi. Buffers |
|
|||
Check Buffers | ||||
Randomize Buffers | ||||
Randomize IDs |
Performing bisection bandwidth tests
Currently only works with MPI
To perform a bisection bandwidth test, in which the parallel bandwidth between two bisecting halves of tasks, usually placed in a specific configuration of interest with respect to the network topology, two options need to be set.
--bisection
splits the tasks into two halves and only tests between them. If the tasks assigned to linktest are enumerated 0
to and including n-1
, where n
is even, then the tasks 0
to and including n/2-1
are assigned to the first half and the tasks n/2
to and including n-1
are assigned to the second half. Tasks should be pinned to nodes such that the desired test configuration is achieved.
--unidirectional
causes LinkTest to test unidirectionally connections in parallel. Testing semidirectionally or bidirectionally does not ensure that communication occurs unidirectionally between the two halves at any given point in time. --bidirictional
can be used with the understanding that at no point the tests guarantee a certain communication pattern and direction between the two bisecting halves. The individual communications can not be sufficiently synchronized for this. For --semidirectional
we have seen that the communication organizes itself in such a way that on a given link communication occurs in one direction, but the direction any given link communicates at any given time is random.
Usage of TCP Communication API Without miniPMI
LinkTest can be configured to test MPI or TCP without the miniPMI library. In the case of MPI no additional work is necessary, aside from executing with mpiexe
or the like, and linktest can be used as above. When testing TCP communication without the miniPMI library the cluster configuration needs to be specified explicitly via the following four environment variables: LINKTEST_TCP_SIZE
, LINKTEST_TCP_RANK
, LINKTEST_TCP_IPADDR_<<<RANK>>>
and LINKTEST_TCP_PORT_<<<RANK>>>
.
LINKTEST_TCP_SIZE
: An integer indicating the number of tasks to be used for the test.
LINKTEST_TCP_RANK
: The rank of the current task.
LINKTEST_TCP_IPADDR_<<<RANK>>>
: The IP address of rank <<<RANK>>
, where <<<RANK>>>
is the eight-digit zero-filled integer rank to which the environment variable corresponds.
LINKTEST_TCP_PORT_<<<RANK>>>
: The communication port to use of rank <<<RANK>>
, where <<<RANK>>>
is the eight-digit zero-filled integer rank to which the environment variable corresponds. Note that it is imperative that these ports are free on the respective machines. LinkTest will not test this, nor will it port-scan to find free ports and communicate them to the partners. Setting free ports is the users responsibility.
For a given task LINKTEST_TCP_SIZE
and LINKTEST_TCP_RANK
must be specified. LINKTEST_TCP_IPADDR_<<<RANK>>>
and LINKTEST_TCP_PORT_<<<RANK>>>
must also be specified for all other tasks.
With the thus configured cluster environment LinkTest can be executed like normal. Below is an example of how to configure this cluster environment given a host-name list, which in this case is queried via a SLURM environment variable under the assumption that this script is submitted via SLURM and that there is one task per node:
# 1. List of Host Names
hosts=($(scontrol show hostnames ${SLURM_JOB_NODELIST} | paste -s -d " "))
# 2. Export TCP Size & Rank
export LINKTEST_TCP_SIZE=${SLURM_NTASKS};
for i in $(seq 0 $((${#hosts[@]}-1))); do
if [ "${HOSTNAME}" == "${hosts[${i}]}" ]; then
export LINKTEST_TCP_RANK=${i};
fi
done
# 3. Export TCP IP-Address & Port
base_port=60000;
for i in $(seq 0 $((${#hosts[@]}-1))); do
task=$(printf "%08d\n" ${i});
export LINKTEST_TCP_IPADDR_${task}=$(getent hosts "${hosts[${i}]}" | awk '{ print $1 }');
export LINKTEST_TCP_PORT_${task}=$((${base_port}+${i}));
done
# 4. Execute LinkTest
linktest --mode tcp --num-warmup-messages 10 --num-messages 1000 --size-messages 1024 --output tcp.sion;
JSC Run Examples
LinkTest on 2048 nodes, 1 task per node, message size 16 MiB, 2 warmup messages and 4 messages for measurement:
xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 2048 srun -n 2048 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024))
LinkTest on 936 nodes, 4 tasks per node (one per GPU) using device memory:
xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 936 srun -n 3744 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024)) --use-gpus
Bidirectional bandwidth test:
xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 936 srun -n 3744 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024)) --use-gpus --bidir
Perform exchange only between bisecting halves:
xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 936 srun -n 3744 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024)) --use-gpus --bisect
LinkTest on JUSUF (MPI through UCP)
$ xenv -L GCC -L CUDA -L ParaStationMPI \
env UCX_TLS=rc_x,self,sm UCX_NET_DEVICES=mlx5_1:1 \
/usr/bin/salloc -A root -N 168 \
srun -n 168 ./linktest --mode mpi \
--num-warmup-messages 4 \
--num-messages 10 \
--size-messages 16777216
Output
LinkTest writes measurement results to stdout and monitoring information to stderr. Additionally by default a binary file in sion format will be produced containing detailed measurement data. These files are often quite sparse, therefore they can be compressed very efficiently if needed.
stdout
The stdout output starts with the settings that were given for this run (exact output depends on configuration)
-------------------- LinkTest Args -------------------------
Virtual-Cluster Implementation: mpi
Message length: 1024 B
Number of Messages: 1000
Number of Messages. (Warmup): 10
Communication Pattern: Semidirectional End to End
use gpus: No
mixing pe order: No
serial test only: No
max serial retest: 2
write protocol (SION): Yes, funneled
output file: "linktest_mpi_2nx4c.sion"
------------------------------------------------------------
followed by the main benchmark cycle
Starting Test of all connections:
---------------------------------
Parallel PingPong for step 1: avg: 3.41977 us ( 285.5639 MiB/s) min: 3.24080 us ( 301.3333 MiB/s) max: 4.20862 us ( 232.0387 MiB/s)
Analyse Summary: min. 3.2408 us ( 301.333 MiB/s) max. 4.2086 us ( 232.039 MiB/s) avg. 3.4198 us ( 285.564 MiB/s)
Timing Summary: 1 step(s) required 33.05570 ms ( 33.05570 ms/step). 6 step(s) remain. Estimated time remaining: 198.33422 ms
Parallel PingPong for step 2: avg: 2.07276 us ( 471.1417 MiB/s) min: 438.45200 ns ( 2.1751 GiB/s) max: 3.87595 us ( 251.9545 MiB/s)
Analyse Summary: min. 438.4520 ns ( 2.175 GiB/s) max. 4.2086 us ( 232.039 MiB/s) avg. 2.7463 us ( 355.597 MiB/s)
Timing Summary: 2 step(s) required 65.68457 ms ( 32.84228 ms/step). 5 step(s) remain. Estimated time remaining: 164.21142 ms
...
In each step warmup and measurement messages are sent to the communication partner. The communication partner changes from step to step. Each step prints the following
Parallel PingPong for step: The aggregated measurement results of the current step
Analyse Summary: The aggregated results for all steps until this point
Timing Summary: Summary how long the steps took so far, and how much longer the benchmark is estemated to run
After the benchmark is finished the aggregated results for all steps are printed
Linktest Timing Results - Iteration 1:
RESULT: Min Time: 433.63310397 ns ( 2.199 GiB/s)
RESULT: Max Time: 4.62629204 us ( 211.090 MiB/s)
RESULT: Avg Time: 2.25120053 us ( 433.796 MiB/s)
At the end the slowest connections are retested in serial, which ensures that LinkTest places no additional stress on the system aside from the stress required to measure the connection. This is useful to see if the poor performance of a given connection may be due to the load LinkTest places on the system, for example the interconnects, or if the connection is just bad, for example due to a badly seated connection.
0: PINGPONG 3 <-> 6: 1st: 4.62629 us ( 211.0897 MiB/s) 2nd: 3.89782 us ( 250.5408 MiB/s)
1: PINGPONG 2 <-> 5: 1st: 4.20862 us ( 232.0387 MiB/s) 2nd: 3.17407 us ( 307.6689 MiB/s)
Linktest Slow-Pairs Results - Iteration 1:
RESULT: Min Time: 3.17407004 us ( 307.669 MiB/s)
RESULT: Max Time: 3.89781850 us ( 250.541 MiB/s)
RESULT: Avg Time: 3.53594427 us ( 276.182 MiB/s)
stderr
The stderr output shows information that are for debugging/monitoring purposes. The following example shows 2 info messages, the memory usage on each node and the runtime of non measuring steps in linktest.
[linktest.cc in main:92] info: System string = "generic"
[benchmark.cc in benchmark:902] info: Using PinnedMmapAllocator
timings[000] [first sync] t= 30.69149 ms
task[000000] on jrc0734.jureca ( 0) mem= 145.5898 kiB
task[000001] on jrc0734.jureca ( 1) mem= 145.3633 kiB
task[000002] on jrc0734.jureca ( 2) mem= 145.3398 kiB
task[000003] on jrc0734.jureca ( 3) mem= 145.3477 kiB
task[000004] on jrc0735.jureca ( 4) mem= 145.4297 kiB
task[000005] on jrc0735.jureca ( 5) mem= 145.3516 kiB
task[000006] on jrc0735.jureca ( 6) mem= 147.4375 kiB
task[000007] on jrc0735.jureca ( 7) mem= 145.4062 kiB
timings[000] [mapping] t= 643.33295 us
timings[000] [randvec] t= 339.93274 ns
PE00000: psum=37 pasum=37 do_mix=0
timings[000] [getpart] t= 14.67997 us
timings[000] [search slow] t= 82.80016 us
timings[000] [test slow] t= 14.33950 ms
linktest_output_sion_collect_local_data[0] alloc+init local buffer of size 831 bytes for 8 tasks
timings[000] [sioncollect] t= 82.95011 us
timings[000] [sioncollwr] t= 101.74610 ms
timings[000] [sionclose] t= 403.51134 us
[sionwrite] 3904 B
timings[000] [all] t= 312.74890 ms
SION Files
Unless turned off, LinkTest will, by default, also generate a binary SION file, whose default name is pingpong_results_bin.sion
. This file contains the LinkTest measurements, a list of the involved hosts, as well as the options passed to LinkTest when it was executed.
If --no-sion-file
is specified as a command-line option when executing LinkTest then no SION file is generated. If --parallel-sion-file
is specified as a command-line option when executing LinkTest then the output SION file, if enabled, will be written out in parallel. This speeds up the output to file systems that support parallel access. The name of the output SION file can be changed via the command-line argument -o
or --output
followed by a space and the name of the file.
SION File Defragmentation
The format of these SION files is optimized for parallel access which causes them to be very sparse. You can compress the SION files as follows:
siondefrag -q 1 input.sion output.sion
where input.sion
is the name of the input SION file and output.sion
is the name of the output SION file. Note that in-place compression is possible, as such the names of the input and output SION files can be identical.