@@ -104,6 +104,44 @@ Except for the MPI and the node-internal CUDA transport layer, all layers utiliz
...
@@ -104,6 +104,44 @@ Except for the MPI and the node-internal CUDA transport layer, all layers utiliz
With any transport layer but MPI or intra-node CUDA it is important to make sure that the PMI (not MPI) environment is correctly set up. The easiest way to achieve this using slurm is: `srun --mpi=pmi2` or `srun --mpi=pmix`. If this option is not available or not supported by slurm please consult the relevant PMI documentation for your system.
With any transport layer but MPI or intra-node CUDA it is important to make sure that the PMI (not MPI) environment is correctly set up. The easiest way to achieve this using slurm is: `srun --mpi=pmi2` or `srun --mpi=pmix`. If this option is not available or not supported by slurm please consult the relevant PMI documentation for your system.
## Usage of TCP Communication API Without miniPMI
Linktest can be configured to test MPI or TCP without the miniPMI library. In the case of MPI no additional work is necessary, aside from executing with `mpiexe` or the like, and linktest can be used as above. When testing TCP communication without the miniPMI library the cluster configuration needs to be specified explicitly via the following four environment variables: `LINKTEST_TCP_SIZE`, `LINKTEST_TCP_RANK`, `LINKTEST_TCP_IPADDR_<<<RANK>>>` and `LINKTEST_TCP_PORT_<<<RANK>>>`.
`LINKTEST_TCP_SIZE`: An integer indicating the number of tasks to be used for the test.
`LINKTEST_TCP_RANK`: The rank of the current task.
`LINKTEST_TCP_IPADDR_<<<RANK>>>`: The IP address of rank `<<<RANK>>`, where `<<<RANK>>>` is the eight-digit zero-filled integer rank to which the environment variable corresponds.
`LINKTEST_TCP_PORT_<<<RANK>>>`: The communication port to use of rank `<<<RANK>>`, where `<<<RANK>>>` is the eight-digit zero-filled integer rank to which the environment variable corresponds. Note that it is imperative that these ports are free on the respective machines. Linktest will not test this, nor will it port-scan to find free ports and communicate them to the partners. Setting free ports is the users responsibility.
For a given task `LINKTEST_TCP_SIZE` and `LINKTEST_TCP_RANK` must be specified. `LINKTEST_TCP_IPADDR_<<<RANK>>>` and `LINKTEST_TCP_PORT_<<<RANK>>>`must also be specified for all other tasks.
With the thus configured cluster environment Linktest can be executed like normal. Below is an example of how to configure this cluster environment given a host-name list, which in this case is queried via a SLURM environment variable under the assumption that this script is submitted via SLURM and that there is one task per node:
```BASH
# 1. List of Host Names
hosts=($(scontrol show hostnames ${SLURM_JOB_NODELIST} | paste -s -d " "))