c++taskopenmpnuma

How can I realize data local spawning or scheduling of tasks in OpenMP on NUMA CPUs?


I have this simple self-contained example of a very rudimentary 2 dimensional stencil application using OpenMP tasks on dynamic arrays to represent an issue that I am having on a problem that is less of a toy problem.
There are 2 update steps in which for each point in the array 3 values are added from another array, from the corresponding location as well as the upper and lower neighbour location. The program is executed on a NUMA CPU with 8 cores and 2 hardware threads on each NUMA node. The array initializations are parallelized and using the environment variables OMP_PLACES=threads and OMP_PROC_BIND=spread the data is evenly distributed among the nodes' memories. To avoid data races I have set up dependencies so that for every section on the second update a task can only be scheduled if the relevant tasks for the sections from the first update step are executed. The computation is correct but not NUMA aware. The affinity clause seems to be not enough to change the scheduling as it is just a hint. I am also not sure whether using the single for task creation is efficient but all I know is it is the only way to make all task sibling tasks and thus the dependencies applicable.

Is there a way in OpenMP where I could parallelize the task creation under these constraints or guide the runtime system to a more NUMA-aware task scheduling? If not, it is also okay, I am just trying to see whether there are options available that use OpenMP in a way that it is intended and not trying to break it. I already have a version that only uses worksharing loops. This is for research.

NUMA NODE 0 pus {0-7,16-23} NUMA NODE 1 pus {8-15,24-31}

Environment Variables

export OMP_PLACES=threads
export OMP_PROC_BIND=spread

#define _GNU_SOURCE // sched_getcpu(3) is glibc-specific (see the man page)
#include <sched.h>
#include <iostream>
#include <omp.h>
#include <math.h>
#include <vector>
#include <string>

typedef double value_type;
int main(int argc, char * argv[]){

    std::size_t rows = 8192;
    std::size_t cols = 8192;
    std::size_t part_rows = 32;
    std::size_t num_threads = 16;

    std::size_t parts = ceil(float(rows)/part_rows);

    value_type * A = (value_type *) malloc(sizeof(value_type)*rows*cols);
    value_type * B = (value_type *) malloc(sizeof(value_type)*rows*cols);
    value_type * C = (value_type *) malloc(sizeof(value_type)*rows*cols);

#pragma omp parallel for schedule(static)
    for (int i = 0; i < rows; ++i)
        for(int j = 0; j<cols; ++j){
            A[i*cols+j] = 1;
            B[i*cols+j] = 1;
            C[i*cols+j] = 0;
        }

    std::vector<std::vector<std::size_t>> putasks(32, std::vector<std::size_t>(2,0));
    std::cout << std::endl;


#pragma omp parallel num_threads(num_threads)
#pragma omp single
{
        
    for(int part=0; part<parts; part++){
        std::size_t row = part * part_rows;
        std::size_t current_first_loc = row * cols;
        //first index of the upper part in the array
        std::size_t upper_part_first_loc = part != 0 ? (part-1)*part_rows*cols : current_first_loc;
        //first index of the lower part in the array
        std::size_t lower_part_first_loc = part != parts-1 ? (part+1)*part_rows * cols : current_first_loc;
        std::size_t start = row;
        std::size_t end = part == parts-1 ? rows-1 : start+part_rows;
        if(part==0) start = 1;

        #pragma omp task depend(in: A[current_first_loc], A[upper_part_first_loc], A[lower_part_first_loc])\
             depend(out: B[current_first_loc]) affinity(A[current_first_loc], B[current_first_loc])
        {
            if(end <= ceil(rows/2.0))
                putasks[sched_getcpu()][0]++;
            else putasks[sched_getcpu()][1]++;
            for(std::size_t i=start; i<end; ++i){
                for(std::size_t j = 0; j < cols; ++j)
                    B[i*cols+j] += A[i*cols+j] + A[(i-1)*cols+j] + A[(i+1)*cols+j];
            }
        }
    }

    for(int part=0; part<parts; part++){
        std::size_t row = part * part_rows;
        std::size_t current_first_loc = row * cols;
        std::size_t upper_part_first_loc = part != 0 ? (part-1)*part_rows*cols : current_first_loc;
        std::size_t lower_part_first_loc = part != parts-1 ? (part+1)*part_rows * cols : current_first_loc;
        std::size_t start = row;
        std::size_t end = part == parts-1 ? rows-1 : start+part_rows;
        if(part==0) start = 1;

        #pragma omp task depend(in: B[current_first_loc], B[upper_part_first_loc], B[lower_part_first_loc])\
             depend(out: C[current_first_loc]) affinity(B[current_first_loc], C[current_first_loc])
        {
            if(end <= ceil(rows/2.0))
                putasks[sched_getcpu()][0]++;
            else putasks[sched_getcpu()][1]++;
            for(std::size_t i=start; i<end; ++i){
                for(std::size_t j = 0; j < cols; ++j)
                    C[i*cols+j] += B[i*cols+j] + B[(i-1)*cols+j] + B[(i+1)*cols+j];
            }
        }
    }
}

    if(rows <= 16 & cols <= 16)
    for(std::size_t i = 0; i < rows; ++i){
        for(std::size_t j = 0; j < cols; ++j){
            std::cout << C[i*cols+j] << " ";
        }
        std::cout << std::endl;
    }


    for(std::size_t i = 0; i<putasks.size(); ++i){
        if(putasks[i][0]!=0 && putasks[i][1]!=0){
            for(std::size_t node = 0; node < putasks[i].size(); ++node){
                std::cout << "pu: " << i << " worked on ";
                std::cout << putasks[i][node]<< " NODE " << node << " tasks" << std::endl;
            }
            std::cout << std::endl;
        }
    }
   
    return 0;
}

Task Distribution Output Excerpt

pu: 1 worked on 26 NODE 0 tasks
pu: 1 worked on 12 NODE 1 tasks

pu: 2 worked on 27 NODE 0 tasks
pu: 2 worked on 13 NODE 1 tasks

...

pu: 7 worked on 26 NODE 0 tasks
pu: 7 worked on 13 NODE 1 tasks

pu: 8 worked on 10 NODE 0 tasks
pu: 8 worked on 11 NODE 1 tasks

pu: 9 worked on 8 NODE 0 tasks
pu: 9 worked on 14 NODE 1 tasks

...

pu: 15 worked on 8 NODE 0 tasks
pu: 15 worked on 12 NODE 1 tasks

Solution

  • First of all, the state of the OpenMP task scheduling on NUMA systems is far from being great in practice. It has been the subject of many research project in the past and they is still ongoing project working on it. Some research runtimes consider the affinity hint properly and schedule the tasks regarding the NUMA node of the in/out/inout dependencies. However, AFAIK mainstream runtimes does not do much to schedule tasks well on NUMA systems, especially if you create all the tasks from a unique NUMA node. Indeed, AFAIK GOMP (GCC) just ignore this and actually exhibit a behavior that make it inefficient on NUMA systems (eg. the creation of the tasks is temporary stopped when there are too many of them and tasks are executed on all NUMA nodes disregarding the source/target NUMA node). IOMP (Clang/ICC) takes into account locality but AFAIK in your case, the scheduling should not be great. The affinity hint for tasks is not available upstream yet. Thus, GOMP and IOMP will clearly not behave well in your case as tasks of different steps will be often distributed in a way that produces many remote NUMA node accesses that are known to be inefficient. In fact, this is critical in your case as stencils are generally memory bound.

    If you work with IOMP, be aware that its task scheduler tends to execute tasks on the same NUMA node where they are created. Thus, a good solution is to create the tasks in parallel. The tasks can be created in many threads bound to NUMA nodes. The scheduler will first try to execute the tasks on the same threads. Workers on the same NUMA node will try to steal tasks of the threads in the same NUMA node, and if there not enough tasks, then from any threads. While this work stealing strategy works relatively well in practice, there is a huge catch: tasks of different parent tasks cannot share dependencies. This limitation of the current OpenMP specification is a big issue for stencil codes (at least the ones that creates tasks working on different time steps). An alternative solution is to create tasks with dependencies from one thread and create smaller tasks from these tasks but due to the often bad scheduling of the big tasks, this approach is generally inefficient in practice on NUMA systems. In practice, on mainstream runtimes, the basic statically-scheduled loops behave relatively well on NUMA system for stencil although it is clearly sub-optimal for large stencils. This is sad and I hope this situation will improve in the current decade.

    Be aware that data initialization matters a lot on NUMA systems as many platform actually allocate pages on the NUMA node performing the first touch. Thus the initialization has to be parallel (otherwise all the pages could be located on the same NUMA node causing a saturation of this node during stencil steps). The default policy is not the same on all platforms and some can move pages between NUMA nodes regarding their use. You can tweak the behavior with numactl. You can also fetch very useful information from the hw-loc tool. I strongly advise you to manually the location of all OpenMP threads using OMP_PROC_BIND=True and OMP_PLACES="{0},{1},...,{n}" where the OMP_PLACES string set can be generated from hw-loc regarding the actual platform.

    For more information you can read this research paper (disclaimer: I am one of the authors). You can certainly find other similar research paper on the IWOMP conference and the Super-Computing conference too. You could try to use research runtime though most of them are not designed to be used in production (eg. KOMP which is not actively developed anymore, StarPU which mainly focus on GPUs and optimizing the critical path, OmpSS which is not fully compatible with OpenMP but try to extend it, PaRSEC which is mainly designed for linear algebra applications).