Skip to content
Snippets Groups Projects
Andreas Herten's avatar
Andreas Herten authored
add jrdc and jsf

See merge request !1
0e6f33a3
History

Fiddling with CPU Mask for GPU/non-GPU Processes

This repo contains examples to generate some more advanced CPU masks for JUWELS Booster and associated tools. While I those are made for JUWELS Booster, they can easily be ported to any other system.

Associated Tools

  • mask-tools.py: Maybe the most relevant tool here. Python CLI app with two sub-commands to a) generate a hexa-decimal mask for given CPU cores (also taking in ranges), and b) generate a binary representation of a hexa-decimal mask to quickly study the mask. See mask-tools.py --help for help and usage info.
  • omp_id.c / Makefile: Simple C program to print MPI rank, master thread ID and OMP-parallel thread IDs; compile with the Makefile
  • combine-masks.py: Takes CPU hex masks for individual NUMA domains and combines NUMA domains pairwise.
  • process_info.sh: Similar to omp_id.c, but no OMP but GPU ID
  • get_close_gpu.sh: Helper script needed by Example 2 just holding a list of GPUs with affinity to a NUMA domain

Example 1

Task: 4 MPI ranks on a node, 1 rank spanning 3 NUMA domains. Process first launches a GPU kernel before opening up OMP parallel regions to keep the CPU cores busy. The GPU-dispatching master rank needs to be launched from a core with GPU affinity.

Strategy: Use OMP_PLACES to provide an explicit list of cores to the application. OMP_PLACES is a list of cores, as retrieved by numactl -s, but resorted such that the first core has GPU affinity.

Usage: Insert put_to_first_core.sh as a wrapper before the application, after Slurm.

  OMP_NUM_THREADS=24 srun -n 4 --cpu-bind=verbose,mask_ldoms:0xc,0x3,0xc0,0x30 ./put_to_first_core.sh ./omp_id

Example 2

Very similar but not identical to previous example; also, this was made much before Example 1

Task: Utilize most cores of JUWELS Booster by associating a core with GPU affinity with remainder cores (without GPU affinity) within a rank. There should be 4 ranks per node; in each rank, 1 GPU process is launched and the remaining cores given to the CPU-only process.

Usage: Launch split_mask.sh with 2 arguments. The first argument is the process which should not run on the GPU, the second argument is the process to be run on the GPU. The according CPU masks are set, CUDA_VISIBLE_DEVICES is set as well. If only 1 argument is provided, it will be taken for both cases. No argument or >2 arguments will result in masks printed (for debugging).

  srun -n 2 --cpu-bind=mask_cpu:0xfff000000000fff,0xfff000000000fff000 \
    bash split_mask.sh ./app1 ./app2

(see below)

The script makes implicit assumptions for the AMD EPYC CPUs in JUWELS Booster. Handle with care on other systems.

Output:

❯ srun -n 9 --cpu-bind=verbose,mask_cpu:0xfff000000000fff,0xfff000000000fff000,0xfff000000000fff000000,0xfff000000000fff000000000 bash split_mask.sh "bash process_info.sh" |& sort
cpu_bind=MASK - jwb0001, task  0  0 [2983]: mask 0xfff000000000fff set
cpu_bind=MASK - jwb0001, task  1  1 [2984]: mask 0xfff000000000fff000 set
cpu_bind=MASK - jwb0001, task  2  2 [2987]: mask 0xfff000000000fff000000 set
cpu_bind=MASK - jwb0001, task  3  3 [2991]: mask 0xfff000000000fff000000000 set
cpu_bind=MASK - jwb0001, task  4  4 [2994]: mask 0xfff000000000fff set
cpu_bind=MASK - jwb0001, task  5  5 [2997]: mask 0xfff000000000fff000 set
cpu_bind=MASK - jwb0001, task  6  6 [3000]: mask 0xfff000000000fff000000 set
cpu_bind=MASK - jwb0001, task  7  7 [3002]: mask 0xfff000000000fff000000000 set
cpu_bind=MASK - jwb0001, task  8  8 [3005]: mask 0xfff000000000fff set
MPI Rank: 0;CUDA_VIS_DEV: 1;pid 3146's current affinity list: 6;Last CPU core: 6
MPI Rank: 0;CUDA_VIS_DEV: ;pid 3112's current affinity list: 0-5,7-11,48-55,57-59;Last CPU core: 2
MPI Rank: 1;CUDA_VIS_DEV: 0;pid 3147's current affinity list: 18;Last CPU core: 18
MPI Rank: 1;CUDA_VIS_DEV: ;pid 3120's current affinity list: 12-17,19-23,60-71;Last CPU core: 14
MPI Rank: 2;CUDA_VIS_DEV: 3;pid 3151's current affinity list: 30;Last CPU core: 30
MPI Rank: 2;CUDA_VIS_DEV: ;pid 3122's current affinity list: 24-29,31-35,72-83;Last CPU core: 28
MPI Rank: 3;CUDA_VIS_DEV: 2;pid 3158's current affinity list: 42;Last CPU core: 42
MPI Rank: 3;CUDA_VIS_DEV: ;pid 3125's current affinity list: 36-41,43-47,84-95;Last CPU core: 88
MPI Rank: 4;CUDA_VIS_DEV: 1;pid 3150's current affinity list: 6;Last CPU core: 6
MPI Rank: 4;CUDA_VIS_DEV: ;pid 3123's current affinity list: 0-5,7-11,48-55,57-59;Last CPU core: 3
MPI Rank: 5;CUDA_VIS_DEV: 0;pid 3148's current affinity list: 18;Last CPU core: 18
MPI Rank: 5;CUDA_VIS_DEV: ;pid 3121's current affinity list: 12-17,19-23,60-71;Last CPU core: 65
MPI Rank: 6;CUDA_VIS_DEV: 3;pid 3154's current affinity list: 30;Last CPU core: 30
MPI Rank: 6;CUDA_VIS_DEV: ;pid 3124's current affinity list: 24-29,31-35,72-83;Last CPU core: 72
MPI Rank: 7;CUDA_VIS_DEV: 2;pid 3155's current affinity list: 42;Last CPU core: 42
MPI Rank: 7;CUDA_VIS_DEV: ;pid 3126's current affinity list: 36-41,43-47,84-95;Last CPU core: 94
MPI Rank: 8;CUDA_VIS_DEV: 1;pid 3152's current affinity list: 6;Last CPU core: 6
MPI Rank: 8;CUDA_VIS_DEV: ;pid 3127's current affinity list: 0-5,7-11,48-55,57-59;Last CPU core: 50