c++oopasynchronousopenmpgpgpu

What is the correct way to use OpenMP Target Enter/Exit/Update for unstructured, asynchronous device-side computations?


Objective

I want to use OpenMP target in C++ in the way that I am currently using CUDA:

  1. Allocate any number of device-side arrays (mirror of host array is fine), initialize if needed.
  2. Execute an arbitrary series of device-side operations using these arrays without copying to host.
  3. Once finished, copy the desired array (e.g., output) to host.

In the end, I want to be able to use this code as a library via e.g. PyBind11, but that is not so important now. To do this, my thinking would be:

Create a class containing this array and whatever metadata and utility methods are needed. Then add #pragma omp target enter/exit/update statements to methods like the constructor, destructor, to_host and set_data, for example.

Then, write functions that take these array objects as input and executes e.g., #pragma omp target parallel for operations on them.

Then simply call said functions as desired, and once done, call my to_host method and use my data on host.

Problem

I run into several issues doing this - first, I am not completely sure what the correct way to use target directives after enter is, but the IBM documentation suggests one simply uses map statements as normal, and community docs suggest using enter and exit in methods is fine.

So what I end up with for my asynchronous operation on arrays is, for example,

void async_op(AsyncArray2D result, AsyncArray2D a, AsyncArray2D b) {
    #pragma omp target map(to: a.data_ptr[:a.size], b.data_ptr[:b.size], result.data_ptr[:result.size])
    #pragma omp teams distribute parallel for
    for (int i = 0; i < result.size; i++) {
        result.data_ptr[i] = a.data_ptr[i] * b.data_ptr[i];
        result.data_ptr[i] -= 1 / b.data_ptr[i];
        result.data_ptr[i] += b.data_ptr[i] / (a.data_ptr[i] + i);
    }
}

NB: I know that I would need to use nowait here for asynchronous operation, but this is not my issue at the moment.

The problem is that, I only get my array back to host if I put #pragma omp target update from(result.data_ptr[:result.size]) in this same function. I can't invoke it with a method, and I can't put it in e.g. main(). If I do, the copying will not work correctly (I just get zeros as output). I can of course also use to/from or implicit mapping, but that just means the copying will be done automatically, as far as I can tell, and my update still does not do anything.

Thus, I'm not even sure my enter statement does anything, because this is almost indistinguishable from not having such a statement in the first place (except maybe that the memory trasnfer is more efficient).

I am guessing the underlying issue is that OpenMP is somehow limited to a given context, but I need to understand what my options are and what the exact problem is in order to devise a solution that will work for my case. Can I only use methods within the same class to do asynchronous operations, or something like that?

Code

Here is a not-quite minimal example, I could compile it with g++-12 -O2 -fopenmp -fno-stack-protector -fcf-protection=none -foffload=nvptx-none example.cpp -o example but I suppose results will vary depending on the system.

#include <cstring>
#include <iostream>
#include <omp.h>


class AsyncArray2D {
public:
    float* data_ptr;
    size_t size;
    bool is_on_device;
    int shape[2], strides[2];

    AsyncArray2D(size_t rows, size_t cols) {
        size = rows * cols;
        shape[0] = rows;
        shape[1] = cols;
        strides[0] = cols;
        strides[1] = 0;
        data_ptr = new float[size];
        #pragma omp target enter data map(alloc: data_ptr[:size])
    }

    void to_host() {
        // This method seems to accomplish nothing
        #pragma omp target update from(data_ptr[0:size])
    }

    void set_data(float * buf_ptr) {
        std::memcpy(data_ptr, buf_ptr, size * sizeof(float));
        // unclear if this accomplishes anything
        #pragma omp target update to(data_ptr[:size])
    }

    ~AsyncArray2D() {
        #pragma omp target exit data map(delete: data_ptr[:size])            
    }
};

void async_op(AsyncArray2D result, AsyncArray2D a, AsyncArray2D b) {
    #pragma omp target map(to: a.data_ptr[:a.size], b.data_ptr[:b.size], result.data_ptr[:result.size])
    #pragma omp teams distribute parallel for
    for (int i = 0; i < result.size; i++) {
        result.data_ptr[i] = a.data_ptr[i] * b.data_ptr[i];
        result.data_ptr[i] -= 1 / b.data_ptr[i];
        result.data_ptr[i] += b.data_ptr[i] / (a.data_ptr[i] + i);
    }
   // uncomment the below line and we get the correct result.
   // #pragma omp target update from(result.data_ptr[:result.size])
}

int main() {
    size_t dim_size = 8;
    
    float * buffer_a = new float[dim_size * dim_size];
    float * buffer_b = new float[dim_size * dim_size];
    float * buffer_result = new float[dim_size * dim_size];
    std::fill_n(buffer_a, dim_size * dim_size, 1.2);
    std::fill_n(buffer_b, dim_size * dim_size, 2.7);
    std::fill_n(buffer_result, dim_size * dim_size, 0);
    std::cout << "Created buffer" << std::endl;
    AsyncArray2D a(dim_size, dim_size);
    AsyncArray2D b(dim_size, dim_size);
    AsyncArray2D result(dim_size, dim_size);
    std::cout << "Created arrays" << std::endl;
    a.set_data(buffer_a);
    b.set_data(buffer_b);
    result.set_data(buffer_result);
    std::cout << "Set data" << std::endl;
    // run for a while so that we can see if the GPU is doing anything...
    for (int i = 0; i < 100; i++) {
        async_op(result, a, b);
    }
    std::cout << "Added results" << std::endl;
    result.to_host();
    std::cout << "Moved to host" << std::endl;
    for (int j = 0; j < dim_size; j++) {
        for (int i = 0; i < dim_size; i++) {
            std::cout << result.data_ptr[j * dim_size + i] << ' ' ;
        }
        std::cout << std::endl;
    }
    return 0;
}

So with this code, I get the result

0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0

but if I uncomment the update line, or changes the mapping of result to from, I get the expected result:

5.11963 4.0969 3.71338 3.51249 3.38886 3.30511 3.24463 3.1989 
3.16311 3.13434 3.1107 3.09094 3.07418 3.05977 3.04726 3.0363 
3.02661 3.01798 3.01025 3.00329 2.99699 2.99125 2.98601 2.9812 
2.97677 2.97268 2.96889 2.96537 2.9621 2.95903 2.95617 2.95348 
2.95096 2.94858 2.94633 2.94422 2.94221 2.94031 2.93851 2.93679 
2.93516 2.93361 2.93213 2.93072 2.92936 2.92807 2.92683 2.92565 
2.92451 2.92341 2.92236 2.92135 2.92038 2.91945 2.91854 2.91767 
2.91683 2.91602 2.91524 2.91448 2.91375 2.91304 2.91235 2.91169

As far as I can tell, the to_host method never does anything, the pragma simply fails to accomplish anything.

Does anyone have guidance on how I can proceed with this? I would rather not have to switch to another framework as I want something as general as possible, and I don't want to be stuck in something like Intel's ecosystem or use something else that would complicate code distribution too much.

EDIT: Just to add, I just tested to make async_op a method of AsyncArray2D invoked by the result array, and this does work as expected (to_host does what it should). However, having to make all operations into methods is a bit limiting for me (and will lead to very ugly coding with massive lists of methods...) so it's not my ideal choice...

Thanks!


Solution

  • OK, as indicated by Joachim in a comment, the answer is that I simply made a stupid mistake in this code by not adding the reference operator to my function. This meant that my objects were passed by value rather than by reference. This code will work:

    void async_op(AsyncArray2D& result, AsyncArray2D& a, AsyncArray2D& b) {
        #pragma omp target map(to: a.data_ptr[:a.size], b.data_ptr[:b.size], result.data_ptr[:result.size])
        #pragma omp teams distribute parallel for
        for (int i = 0; i < result.size; i++) {
            result.data_ptr[i] = a.data_ptr[i] * b.data_ptr[i];
            result.data_ptr[i] -= 1 / b.data_ptr[i];
            result.data_ptr[i] += b.data_ptr[i] / (a.data_ptr[i] + i);
        }
    }
    

    Suits me for writing this in C++ rather than C, which I know better, to begin with... but anyway, the immediate problem is solved.