How to exploit Unified Memory in OpenCL with CL_MEM_ALLOC_HOST_PTR flag?

How can I use unified memory on an OpenCL device without copying data correctly?

OpenCL defined CL_DEVICE_HOST_UNIFIED_MEMORY as:

Is CL_TRUE if the device and the host have a unified memory subsystem and is CL_FALSE otherwise.

The ARM Mali docs suggests to use unified memory, the flag CL_MEM_ALLOC_HOST_PTR should be used while creating a buffer. The flag is described as:

This is a hint to the driver indicating that the buffer is accessed on the host side. To use the buffer on the application processor side, you must map this buffer and write the data into it. This is the only method that does not involve copying data. If you must fill in an image that is processed by the GPU, this is the best way to avoid a copy.

However, there are no examples of how to use buffers created with the flag CL_MEM_ALLOC_HOST_PTR. The example I wrote does not seem to do it properly.

Consider the following code snippet to use such buffers:

    // Create Buffers
    constexpr size_t n_bytes = sizeof(int) * SZ_ARR;
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
                        n_bytes);
    cl_int error{0};
    int* A = static_cast<int*>(
        queue.enqueueMapBuffer(buffer_A, CL_FALSE, CL_MAP_WRITE, 0, n_bytes,
                               nullptr, nullptr, &error));
    gpuErrchk(error);
    for (size_t i = 0; i < SZ_ARR; ++i) {
        A[i] = i;
    }
    gpuErrchk(queue.enqueueUnmapMemObject(buffer_A, A));
    cl::Kernel add(program, "add_and_print");
    add.setArg(0, buffer_A);
    gpuErrchk(queue.enqueueNDRangeKernel(add, cl::NullRange,
                                         cl::NDRange(SZ_ARR), cl::NullRange));
    queue.finish();

I am creating the buffer buffer_A to contain ten ints. I expect to write these ten ints after mapping them with enqueueMapBuffer and then unmapping them with enqueueUnmapMemObject. The kernel add_and_print adds 1 to each array element and prints the resulting value. In particular, the kernel is:

    const std::string kernel_code =
        "   void kernel add_and_print(global int* A) {"
        "       int i = get_global_id(0);"
        "       A[i] = A[i] + 1;"
        "       printf(\"%d\", A[i]);"
        "   }";

However, the program prints 1 for each array element.

How can I use a unified buffer properly?

For reference, the full program to reproduce the code is below:

#include <CL/opencl.hpp>
#include <iostream>

#define gpuErrchk(ans) \
    { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cl_int code, const char* file, int line,
                      bool abort = true) {
    if (code != CL_SUCCESS) {
        fprintf(stderr, "GPUassert, error code: %d %s %d\n", code, file, line);
        if (abort) exit(code);
    }
}

constexpr size_t SZ_ARR = 10;

cl::Device getDevice() {
    std::vector<cl::Platform> all_platforms;
    gpuErrchk(cl::Platform::get(&all_platforms));
    cl::Platform default_platform = all_platforms[0];
    std::vector<cl::Device> all_devices;
    gpuErrchk(default_platform.getDevices(CL_DEVICE_TYPE_GPU, &all_devices));
    cl::Device default_device = all_devices[0];
    std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>()
              << "\n";
    return default_device;
}

cl::Program buildProgram(cl::Context& context, cl::Device& device) {
    const std::string kernel_code =
        "   void kernel add_and_print(global int* A) {"
        "       int i = get_global_id(0);"
        "       A[i] = A[i] + 1;"
        "       printf(\"%d\", A[i]);"
        "   }";
    cl::Program::Sources sources{{kernel_code.c_str(), kernel_code.length()}};

    cl::Program program(context, sources);
    if (program.build({device}) != CL_SUCCESS) {
        std::cout << "Error building: "
                  << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
                  << std::endl;
        exit(1);
    }
    return program;
}

int main() {
    // Prologue: Get device, context, and build program
    cl::Device default_device = getDevice();
    cl::Context context({default_device});
    cl::Program program = buildProgram(context, default_device);
    cl::CommandQueue queue(context, default_device);

    // Create Buffers
    constexpr size_t n_bytes = sizeof(int) * SZ_ARR;
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
                        n_bytes);
    cl_int error{0};
    int* A = static_cast<int*>(
        queue.enqueueMapBuffer(buffer_A, CL_FALSE, CL_MAP_WRITE, 0, n_bytes,
                               nullptr, nullptr, &error));
    gpuErrchk(error);
    for (size_t i = 0; i < SZ_ARR; ++i) {
        A[i] = i;
    }
    gpuErrchk(queue.enqueueUnmapMemObject(buffer_A, A));
    cl::Kernel add(program, "add_and_print");
    add.setArg(0, buffer_A);
    gpuErrchk(queue.enqueueNDRangeKernel(add, cl::NullRange,
                                         cl::NDRange(SZ_ARR), cl::NullRange));
    queue.finish();
}

Solution

This is indeed the correct way to use OpenCL unified memory. There is only a minor mistake in the program, where it does not wait for clEnqueueMapBuffer to complete. Essentially second arguments should be changed to CL_TRUE like

int* A = static_cast<int*>(
    queue.enqueueMapBuffer(buffer_A, CL_TRUE, CL_MAP_WRITE, 0, n_bytes,
                           nullptr, nullptr, &error));

This should produce the following expected output from the OpenCL kernel

$ clang++-15 -Iexternal/OpenCL-CLHPP/include -Iexternal/OpenCL-Headers/ -DCL_HPP_TARGET_OPENCL_VERSION=300 -DCL_TARGET_OPENCL_VERSION=300 -lOpenCL opencl_unified_memory.cpp -o opencl_um 
$ ./opencl_um 
Using device: NVIDIA T1200 Laptop GPU
12345678910

In a typical OpenCL program with many Device kernels, application code may enqueue multiple buffers without waiting for the OpenCL runtime to complete. The application thread can then continue to do some meaningful work and only later return to the mapped buffer by checking the event argument. Please note that synchronization introduced either via blocking map calls or via events can reduce host<->device throughput and should ideally be just done towards the end of compute cycle on an OpenCL Device.