openclpyopencl

Does CL_DEVICE_HOST_UNIFIED_MEMORY change how buffers are handled?


I'm learning OpenCL, using PyOpenCL and Numpy. So far I've learned that when passing a Numpy array to a kernel, you need to create a pyopencl.Buffer object that represents an array on the device, and then you enqueue a copy from the Numpy array to the Buffer onto the CommandQueue.

This makes sense when the device is a graphics card with its own memory, but what happens when the device shares memory with the host? For example, the M1 chip in my MacBook Pro has a unified memory architecture, and the device shows host_unified_memory is 1. In the OpenCL 1.2 specification, this device property is listed as CL_DEVICE_HOST_UNIFIED_MEMORY and it says

Is CL_TRUE if the device and the host have a unified memory subsystem and is CL_FALSE otherwise

Does this mean I can change my code to avoid copying buffers on unified memory systems? Or does OpenCL detect that the memory is unified and thus make the enqueued copy a no-op?


Solution

  • I would. I tend to uniformly use clEnqueueMapBuffer on all systems for most cases, but there are tradeoffs. Here's what'll happen.

    1.a) Ask the driver to allocate the buffer for you via CL_MEM_ALLOC_HOST_PTR. On integrated/unified systems it'll probably create pinned memory shared by GPU and CPU. The driver also deals with the alignment rules and whatnot.

    OR

    1.b) Use a sufficiently aligned pointer with CL_MEM_USE_HOST_PTR. E.g. old Intel GPUs used even-cacheline (128 B). I always use system page alignment. I am not sure on your system. If you don't need to own the memory, go with 1.a) as it'll be more robust.

    Either approach works and one of two things happens.

    2.a) On systems without unified memory (e.g. discrete parts) the buffer will cause the driver to create a shadow/staging buffer copy for you on the host and upload it when you are done with it somewhere between (clEnqueueUnmapMemObject and the clEnqueueNDRangeKernel that uses it). You can be blissfully unaware of this. It'll just work. Of course, super large buffers may not be efficient since you're creating an extra buffer.

    2.b) On systems with unified memory the map operation will be a zero-copy operation which is pretty cheap. Note: even though it's pinned memory (the GPU is accessing the same physical pages as the CPU) make sure you obey the OpenCL rules to only access the buffer within enqueue map and enqueue unmap calls. The driver can change it's mind on you for any number of reasons and create a staging copy as in 2.a). E.g. if you pass a misaligned CL_MEM_USE_HOST_PTR base for the driver to use, it'll make staging copy on you.

    In both cases things will "just work".