Changes

Yannik Müller · 3552d7fe
--- a/Usage.md
+++ b/Usage.md
+Linktest has to be started in parallel, with an even number of proccesses for example using `srun --ntasks 2 linktest. You can check the usage via `linktest -h` (even without srun), which should look similar to this
+```
+Usage: ./linktest options
+
+with the following optional options (default values in parathesis):
+
+ -h/--help                       print help message and exit                                                                             
+ -v/--version                    print version and exit                                                                                  
+ -w/--num-warmup-messages VAL    number of warmup pingpong messages [REQUIRED]                                                           
+ -n/--num-messages VAL           number of pingpong messages [REQUIRED]                                                                  
+ -s/--size-messages VAL          message size in bytes [REQUIRED]                                                                        
+ -m/--mode VAL                   transport Layer to be used [REQUIRED]                                                                   
+ --alltoall                      perform all-to-all modus (e.g. MPI_Alltoall)                                                         (0)
+ --bidir                         perform bidirectional tests                                                                          (0)
+ --use-gpus                      use GPUs                                                                                             (0)
+ --bisect                        perform a bandwidth tests between bisecting halves                                                   (0)
+ --randomize                     randomize processor numbers                                                                          (0)
+ --serial-tests                  serialize tests                                                                                      (0)
+ --no-sion-file                  don not write data to sion file                                                                      (0)
+ --parallel-sion-file            write data in parallel (sion)                                                                        (0)
+ --num-slowest VAL               number of slowest pairs to be retested                                                              (10)
+ --min-iterations VAL            linktest repeats for at least <min_iterations>                                                       (1)
+ --min-runtime VAL               linktest runs for at least <min_runtime> seconds communication time                                (0.0)
+ -o/--output VAL                 output file name                                                             (pingpong_results_bin.sion)
+
+```
+exampleRun.sh showcases the execution assuming the build procedure used exampleBuild.sh
+The transport layer is usually given through the --mode option. In rare cases where this doesn't work, you can fall back to the linktest.LAYER executables, and/or set the environment variable `LINKTEST_VCLUSTER_IMPL`.
+
+Except for the MPI and the node-internal CUDA transport layer, all layers utilize the TCP sockets implementation underneath for setup and exchange of data in non-benchmark code segments. The TCP layer implementation uses a lookup of the hostname of the node to determine the IPs for the initial connection setup. There are currently only limited methods to customize this behavior. The code supports the option to set `LINKTEST_SYSTEM_NODENAME_SUFFIX` as a suffix to be added to the short hostname. For example, on JSC systems, `LINKTEST_SYSTEM_NODENAME_SUFFIX=i` may need to be exported to make sure the out-of-band connection setup is done via the IPoIB network.
+
+Whith any transport layer but MPI or intra-node CUDA it is important to make sure that the PMI (not MPI) environment is correctly set up. The easiest way to achieve this using slurm is to specify one of the following in the srun command: `--mpi=pmi2` or `--mpi=pmix`. If this option is not available or not supported by slurm please consult the relevant PMI documentation for your system.
+
+# JSC Run Examples
+
+**Linktest on 2048 nodes, 1 task per node, message size 16 MiB, 2 warmup messages and 4 messages for measurement:**
+```
+xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 2048 srun -n 2048 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024))
+```
+
+**Linktest on 936 nodes, 4 tasks per node (one per GPU) using device memory:**
+```
+xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 936 srun -n 3744 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024)) --use-gpus
+```
+
+**Bidirectional bandwidth test:**
+```
+xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 936 srun -n 3744 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024)) --use-gpus --bidir
+```
+
+**Perform exchange only between bisecting halves:**
+```
+xenv -L GCC -L CUDA -L ParaStationMPI -L SIONlib salloc -N 936 srun -n 3744 ./linktest --mode mpi --num-warmup-messages 2 --num-messages 4 --size-messages $((16*1024*1024)) --use-gpus --bisect
+```
+
+**Linktest on JUSUF (MPI through UCP)**
+
+```
+$ xenv -L GCC -L CUDA -L ParaStationMPI \
+    env UCX_TLS=rc_x,self,sm UCX_NET_DEVICES=mlx5_1:1 \
+    /usr/bin/salloc -A root -N 168 \
+        srun -n 168 ./linktest --mode mpi \
+            --num-warmup-messages 4 \
+            --num-messages 10 \
+            --size-messages 16777216
+```
+
+# Output
+Linktest writes measurement results to stdout and monitoring information to stderr. Additionally by default a binary file in sion format will be produced containing detailed measurement data. These files are often quite sparse, therefore they can be compressed very efficiently if needed.
+
+## stdout
+The stdout output starts with the settings that were given for this run
+```
+------------------- Linktest Args ------------------------
+Virtual-Cluster Implementation:                          mpi
+Message length:                                       1024 B
+Number of Messages:                                     1000
+Number of Messages. (Warmup):                             10
+Communication Pattern:            Semidirectional End to End
+use gpus:                                                 No
+mixing pe order:                                          No
+serial test only:                                         No
+max serial retest:                                         2
+write protocol (SION):                         Yes, funneled
+output file:                          "linktest_mpi_2nx4c.sion"
+----------------------------------------------------------
+```
+followed by the main benchmark cycle
+```
+
+Starting Test of all connections:
+---------------------------------
+Parallel PingPong for step     1: avg:    3.41977 us (  285.5639 MiB/s) min:    3.24080 us (  301.3333 MiB/s) max:    4.20862 us (  232.0387 MiB/s)
+
+Analyse Summary: min.    3.2408 us (  301.333 MiB/s) max.    4.2086 us (  232.039 MiB/s) avg.    3.4198 us (  285.564 MiB/s)
+Timing Summary: 1 step(s) required   33.05570 ms (  33.05570 ms/step). 6 step(s) remain. Estimated time remaining:  198.33422 ms
+
+Parallel PingPong for step     2: avg:    2.07276 us (  471.1417 MiB/s) min:  438.45200 ns (    2.1751 GiB/s) max:    3.87595 us (  251.9545 MiB/s)
+
+Analyse Summary: min.  438.4520 ns (    2.175 GiB/s) max.    4.2086 us (  232.039 MiB/s) avg.    2.7463 us (  355.597 MiB/s)
+Timing Summary: 2 step(s) required   65.68457 ms (  32.84228 ms/step). 5 step(s) remain. Estimated time remaining:  164.21142 ms
+
+...
+```
+In each step warmup and measurement messages are sent to the communication partner. The communication partner changes from step to step. Each step prints the following  
+**Parallel PingPong for step:** The aggregated measurement results of the current step  
+**Analyse Summary:** The aggregated results for all steps until this point  
+**Timing Summary:**  Summary how long the steps took so far, and how much longer the benchmark is estemated to run  
+
+After the benchmark is finished the aggregated results for all steps are printed
+```
+Linktest Timing Results - Iteration   1:
+RESULT: Min Time:  433.63310397 ns (    2.199 GiB/s)
+RESULT: Max Time:    4.62629204 us (  211.090 MiB/s)
+RESULT: Avg Time:    2.25120053 us (  433.796 MiB/s)
+```
+At last the slowest connections are tested again, this time in serial, thus ensuring there is no stress on the overall system (f.ex. interconnect).
+```
+      0: PINGPONG      3  <->       6: 1st:    4.62629 us (  211.0897 MiB/s)    2nd:    3.89782 us (  250.5408 MiB/s)
+      1: PINGPONG      2  <->       5: 1st:    4.20862 us (  232.0387 MiB/s)    2nd:    3.17407 us (  307.6689 MiB/s)
+Linktest Slow-Pairs Results - Iteration   1:
+RESULT: Min Time:    3.17407004 us (  307.669 MiB/s)
+RESULT: Max Time:    3.89781850 us (  250.541 MiB/s)
+RESULT: Avg Time:    3.53594427 us (  276.182 MiB/s)
+```
+## stderr
+The stderr output shows information that are for debugging/monitoring purposes. The following f.ex. shows 2 info messages, the memory usage on each node and the runtime of non measuring steps in linktest.
+```
+ [linktest.cc in main:92] info: System string = "generic"
+ [benchmark.cc in benchmark:902] info: Using PinnedMmapAllocator
+timings[000] [first sync]                   t=  30.69149 ms
+task[000000] on jrc0734.jureca (   0) mem=  145.5898 kiB
+task[000001] on jrc0734.jureca (   1) mem=  145.3633 kiB
+task[000002] on jrc0734.jureca (   2) mem=  145.3398 kiB
+task[000003] on jrc0734.jureca (   3) mem=  145.3477 kiB
+task[000004] on jrc0735.jureca (   4) mem=  145.4297 kiB
+task[000005] on jrc0735.jureca (   5) mem=  145.3516 kiB
+task[000006] on jrc0735.jureca (   6) mem=  147.4375 kiB
+task[000007] on jrc0735.jureca (   7) mem=  145.4062 kiB
+timings[000] [mapping]                      t= 643.33295 us
+timings[000] [randvec]                      t= 339.93274 ns
+PE00000: psum=37 pasum=37 do_mix=0
+timings[000] [getpart]                      t=  14.67997 us
+timings[000] [search slow]                  t=  82.80016 us
+timings[000] [test slow]                    t=  14.33950 ms
+linktest_output_sion_collect_local_data[0] alloc+init local buffer of size 831 bytes for 8 tasks
+timings[000] [sioncollect]                  t=  82.95011 us
+timings[000] [sioncollwr]                   t= 101.74610 ms
+timings[000] [sionclose]                    t= 403.51134 us
+             [sionwrite]                    3904 B
+timings[000] [all]                          t= 312.74890 ms
+```
+## sion file
+The sion file with the default name `pingpong_results_bin.sion` contains measurement data in binary form. Its format is optimized for parallel access which causes it to be very sparse. You can compress it for minimal storage usage with `siondefrag -q 1 ...`.
\ No newline at end of file