Add stuff for recent additions, clean-up

3e0e0b83 · Andreas Herten · 768fad2e · 3e0e0b83 · 3e0e0b83
Commit 3e0e0b83 authored 4 years ago by Andreas Herten
--- a/README.md
+++ b/README.md
-# Advanced CPU Mask for GPU/non-GPU Processes
+# Fiddling with CPU Mask for GPU/non-GPU Processes

-Task: Utilize most cores of JUWELS Booster by associating a core with GPU affinity with remainder cores (without GPU affinity) within a rank. There should be 4 ranks per node; in each rank, 1 GPU process is launched and the remaining cores given to the CPU-only process.
+This repo contains examples to generate some more advanced CPU masks for JUWELS Booster and associated tools. While I those are made for JUWELS Booster, they can easily be ported to any other system.

-## Usage
+## Associated Tools

-Launch `split_mask.sh` with 2 arguments. The first argument is the process which should not run on the GPU, the second argument is the process to be run on the GPU. The acording CPU masks are set, `CUDA_VISIBLE_DEVICES` is set as well. If only 1 argument is provided, it will be taken for both cases. No argument or >2 arguments will result in masks printed (for debugging).
+* `mask-tools.py`: Maybe the most relevant tool here. Python CLI app with two sub-commands to a) generate a hexa-decimal mask for given CPU cores (also taking in ranges), and b) generate a binary representation of a hexa-decimal mask to quickly study the mask. See `mask-tools.py --help` for help and usage info.
+* `omp_id.c` / `Makefile`: Simple C program to print MPI rank, master thread ID and OMP-parallel thread IDs; compile with the `Makefile`
+* `combine-masks.py`: Takes CPU hex masks for individual NUMA domains and combines NUMA domains pairwise.
+* `process_info.sh`: Similar to `omp_id.c`, but no OMP but GPU ID
+* `get_close_gpu.sh`: Helper script needed by Example 2 just holding a list of GPUs with affinity to a NUMA domain
+
+## Example 1
+
+**Task:** 4 MPI ranks on a node, 1 rank spanning 3 NUMA domains. Process first launches a GPU kernel before opening up OMP parallel regions to keep the CPU cores busy. The GPU-dispatching master rank needs to be launched from a core with GPU affinity.
+
+**Strategy:** Use `OMP_PLACES` to provide an explicit list of cores to the application. `OMP_PLACES` is a list of cores, as retrieved by `numactl -s`, but resorted such that the first core has GPU affinity.
+
+**Usage:** Insert `put_to_first_core.sh` as a wrapper before the application, after Slurm.
+
+```bash
+  OMP_NUM_THREADS=24 srun -n 4 --cpu-bind=verbose,mask_ldoms:0xc,0x3,0xc0,0x30 ./put_to_first_core.sh ./omp_id
+```
+
+## Example 2
+_Very similar but not identical to previous example; also, this was made much before Example 1_
+
+**Task:** Utilize most cores of JUWELS Booster by associating a core with GPU affinity with remainder cores (without GPU affinity) within a rank. There should be 4 ranks per node; in each rank, 1 GPU process is launched and the remaining cores given to the CPU-only process.
+
+**Usage:** Launch `split_mask.sh` with 2 arguments. The first argument is the process which should not run on the GPU, the second argument is the process to be run on the GPU. The according CPU masks are set, `CUDA_VISIBLE_DEVICES` is set as well. If only 1 argument is provided, it will be taken for both cases. No argument or >2 arguments will result in masks printed (for debugging).

 ```bash
  srun -n 2 --cpu-bind=mask_cpu:0xfff000000000fff,0xfff000000000fff000 \
@@ -14,14 +37,7 @@ Launch `split_mask.sh` with 2 arguments. The first argument is the process which

 The script makes implicit assumptions for the AMD EPYC CPUs in JUWELS Booster. Handle with care on other systems.

-## Helpers
-
-In addition, the following helpers are provided:
-  * `calc_pinning.py`: Takes CPU hex masks for individual NUMA domains and combines NUMA domains pairwise. Stuff is printed.
-  * `process_info.sh`: Simple script to print some info relating to current affinity; GPU ID, MPI rank, CPU mask, last CPU core run on
-  * `get_close_gpu.sh`: Helper script needed by `split_mask.sh` to get a close GPU to a NUMA domain
-
-## Sample Output
+**Output**:

 ```bash
 ❯ srun -n 9 --cpu-bind=verbose,mask_cpu:0xfff000000000fff,0xfff000000000fff000,0xfff000000000fff000000,0xfff000000000fff000000000 bash split_mask.sh "bash process_info.sh" |& sort
@@ -53,5 +69,3 @@ MPI Rank: 7;CUDA_VIS_DEV: ;pid 3126's current affinity list: 36-41,43-47,84-95;L
 MPI Rank: 8;CUDA_VIS_DEV: 1;pid 3152's current affinity list: 6;Last CPU core: 6
 MPI Rank: 8;CUDA_VIS_DEV: ;pid 3127's current affinity list: 0-5,7-11,48-55,57-59;Last CPU core: 50
 ```
-
-Andreas Herten, 12 December 2020
--- a/calc_pinning.py
+++ b/calc_pinning.py