How can I use unified memory on an OpenCL device without copying data correctly?
OpenCL defined CL_DEVICE_HOST_UNIFIED_MEMORY as:
Is CL_TRUE if the device and the host have a unified memory subsystem and is CL_FALSE otherwise.
The ARM Mali docs suggests to use unified memory, the flag CL_MEM_ALLOC_HOST_PTR
should be used while creating a buffer. The flag is described as:
This is a hint to the driver indicating that the buffer is accessed on the host side. To use the buffer on the application processor side, you must map this buffer and write the data into it. This is the only method that does not involve copying data. If you must fill in an image that is processed by the GPU, this is the best way to avoid a copy.
However, there are no examples of how to use buffers created with the flag CL_MEM_ALLOC_HOST_PTR
. The example I wrote does not seem to do it properly.
Consider the following code snippet to use such buffers:
// Create Buffers
constexpr size_t n_bytes = sizeof(int) * SZ_ARR;
cl::Buffer buffer_A(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
n_bytes);
cl_int error{0};
int* A = static_cast<int*>(
queue.enqueueMapBuffer(buffer_A, CL_FALSE, CL_MAP_WRITE, 0, n_bytes,
nullptr, nullptr, &error));
gpuErrchk(error);
for (size_t i = 0; i < SZ_ARR; ++i) {
A[i] = i;
}
gpuErrchk(queue.enqueueUnmapMemObject(buffer_A, A));
cl::Kernel add(program, "add_and_print");
add.setArg(0, buffer_A);
gpuErrchk(queue.enqueueNDRangeKernel(add, cl::NullRange,
cl::NDRange(SZ_ARR), cl::NullRange));
queue.finish();
I am creating the buffer buffer_A
to contain ten ints. I expect to write these ten ints after mapping them with enqueueMapBuffer
and then unmapping them with enqueueUnmapMemObject
. The kernel add_and_print
adds 1
to each array element and prints the resulting value. In particular, the kernel is:
const std::string kernel_code =
" void kernel add_and_print(global int* A) {"
" int i = get_global_id(0);"
" A[i] = A[i] + 1;"
" printf(\"%d\", A[i]);"
" }";
However, the program prints 1
for each array element.
How can I use a unified buffer properly?
For reference, the full program to reproduce the code is below:
#include <CL/opencl.hpp>
#include <iostream>
#define gpuErrchk(ans) \
{ gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cl_int code, const char* file, int line,
bool abort = true) {
if (code != CL_SUCCESS) {
fprintf(stderr, "GPUassert, error code: %d %s %d\n", code, file, line);
if (abort) exit(code);
}
}
constexpr size_t SZ_ARR = 10;
cl::Device getDevice() {
std::vector<cl::Platform> all_platforms;
gpuErrchk(cl::Platform::get(&all_platforms));
cl::Platform default_platform = all_platforms[0];
std::vector<cl::Device> all_devices;
gpuErrchk(default_platform.getDevices(CL_DEVICE_TYPE_GPU, &all_devices));
cl::Device default_device = all_devices[0];
std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>()
<< "\n";
return default_device;
}
cl::Program buildProgram(cl::Context& context, cl::Device& device) {
const std::string kernel_code =
" void kernel add_and_print(global int* A) {"
" int i = get_global_id(0);"
" A[i] = A[i] + 1;"
" printf(\"%d\", A[i]);"
" }";
cl::Program::Sources sources{{kernel_code.c_str(), kernel_code.length()}};
cl::Program program(context, sources);
if (program.build({device}) != CL_SUCCESS) {
std::cout << "Error building: "
<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
<< std::endl;
exit(1);
}
return program;
}
int main() {
// Prologue: Get device, context, and build program
cl::Device default_device = getDevice();
cl::Context context({default_device});
cl::Program program = buildProgram(context, default_device);
cl::CommandQueue queue(context, default_device);
// Create Buffers
constexpr size_t n_bytes = sizeof(int) * SZ_ARR;
cl::Buffer buffer_A(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
n_bytes);
cl_int error{0};
int* A = static_cast<int*>(
queue.enqueueMapBuffer(buffer_A, CL_FALSE, CL_MAP_WRITE, 0, n_bytes,
nullptr, nullptr, &error));
gpuErrchk(error);
for (size_t i = 0; i < SZ_ARR; ++i) {
A[i] = i;
}
gpuErrchk(queue.enqueueUnmapMemObject(buffer_A, A));
cl::Kernel add(program, "add_and_print");
add.setArg(0, buffer_A);
gpuErrchk(queue.enqueueNDRangeKernel(add, cl::NullRange,
cl::NDRange(SZ_ARR), cl::NullRange));
queue.finish();
}
This is indeed the correct way to use OpenCL unified memory. There is only a minor mistake in the program, where it does not wait for clEnqueueMapBuffer
to complete. Essentially second arguments should be changed to CL_TRUE
like
int* A = static_cast<int*>(
queue.enqueueMapBuffer(buffer_A, CL_TRUE, CL_MAP_WRITE, 0, n_bytes,
nullptr, nullptr, &error));
This should produce the following expected output from the OpenCL kernel
$ clang++-15 -Iexternal/OpenCL-CLHPP/include -Iexternal/OpenCL-Headers/ -DCL_HPP_TARGET_OPENCL_VERSION=300 -DCL_TARGET_OPENCL_VERSION=300 -lOpenCL opencl_unified_memory.cpp -o opencl_um
$ ./opencl_um
Using device: NVIDIA T1200 Laptop GPU
12345678910
In a typical OpenCL program with many Device kernels, application code may enqueue multiple buffers without waiting for the OpenCL runtime to complete. The application thread can then continue to do some meaningful work and only later return to the mapped buffer by checking the event
argument. Please note that synchronization introduced either via blocking map calls or via events can reduce host<->device throughput and should ideally be just done towards the end of compute cycle on an OpenCL Device.