c++multithreadingopenmpaffinitynuma

Problem of sorting OpenMP threads into NUMA nodes by experiment


I'm attempting to create a std::vector<std::set<int>> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().

Topo: enter image description here

Idea:

  1. Create data which is larger than L3 cache,
  2. set first touch using thread 0,
  3. perform multiple experiments to determine the minimum access time of each thread,
  4. extract the threads into nodes based on sorted access times and information about the topology.

Code: (Intel compiler, OpenMP)

    // create data which will be shared by multiple threads
    const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
    const auto size = 2 * part_size;
    auto container = std::unique_ptr<double>(new double[size]);
    
    // open a parallel section
    auto thread_count = 0;
    auto thread_id_min_duration = std::multimap<double, int>{};
    #ifdef DECIDE_THREAD_COUNT
    #pragma omp parallel num_threads(std::thread::hardware_concurrency())
    #else
    #pragma omp parallel
    #endif
    {
        // perform first touch using thread 0
        const auto thread_id = omp_get_thread_num();
        if (thread_id == 0)
        {
            thread_count = omp_get_num_threads();
            for (auto index = std::size_t{}; index < size; ++index)
            {
                container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
            }
        }
        #pragma omp barrier
        
        // access the data using all threads individually
        #pragma omp for schedule(static, 1)
        for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
        {
            // calculate the minimum access time of this thread
            auto this_thread_min_duration = std::numeric_limits<double>::max();
            for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
            {
                const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
                const auto start_timestamp = omp_get_wtime();
                for (auto index = std::size_t{}; index < part_size; ++index)
                {
                    static volatile auto exceedingly_interesting_value_wink_wink = data[index];
                }
                const auto end_timestamp = omp_get_wtime();
                const auto duration = end_timestamp - start_timestamp;
                if (duration < this_thread_min_duration)
                {
                    this_thread_min_duration = duration;
                }
            }
            #pragma omp critical
            {
                thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
            }
        }
    } // #pragma omp parallel

Not shown here is code which outputs the minimum access times sorted into the multimap.

Env. and Output

  1. How do OMP_PLACES and OMP_PROC_BIND work?

I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24. However, I'm getting this output:

enter image description here

What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x and another 12 with access time ~2x.

  1. Why is the above happening?

Additional Information

Even more puzzling are the following environments and their outputs:

  1. export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=26 enter image description here

  2. export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=48 enter image description here

Any help in understanding this phenomenon would be much appreciated.


Solution

  • After more investigation, I note the following:

    1. work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
    2. memory page migration is a thing on modern NUMA systems.

    Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.

        // create data which will be shared by multiple threads
        const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
        const auto total_size = thread_count * size_per_thread;
        double* data = nullptr;
        posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
        if (data == nullptr)
        {
            throw std::runtime_error("could_not_allocate_memory_error");
        }
        
        // perform first touch using thread 0
        #pragma omp parallel num_threads(thread_count)
        {
            if (omp_get_thread_num() == 0)
            {
                #pragma omp simd safelen(8)
                for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
                {
                    data[d_index] = -1.0;
                }
            }
        } // #pragma omp parallel
        mlock(data, total_size); // page migration is a real thing...
        
        // open a parallel section
        auto thread_id_avg_latency = std::multimap<double, int>{};
        auto generator = std::mt19937(); // heavy object can be created outside parallel
        #pragma omp parallel num_threads(thread_count) private(generator)
        {
            // access the data using all threads individually
            #pragma omp for schedule(static, 1)
            for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
            {
                // seed each thread's generator
                generator.seed(thread_counter + 1);
                
                // calculate the minimum access latency of this thread
                auto this_thread_avg_latency = 0.0;
                const auto experiment_count = 250;
                for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
                {
                    const auto start_timestamp = omp_get_wtime() * 1E+6;
                    for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
                    {
                        const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
                        auto& datapoint = data[thread_counter * size_per_thread + index];
                        datapoint += index;
                    }
                    const auto end_timestamp = omp_get_wtime() * 1E+6;
                    this_thread_avg_latency += end_timestamp - start_timestamp;
                }
                this_thread_avg_latency /= experiment_count;
                #pragma omp critical
                {
                    thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
                }
            }
        } // #pragma omp parallel
        std::free(data);
    

    With these changes, I am noticing the difference I expected.

    enter image description here

    Further notes:

    1. this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
    2. there is no reliable cross-platform way of doing this (requires kernel-APIs),
    3. OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)

    The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?