concurrencyopenclobjective-c-blocksgrand-central-dispatchlibdispatch

Concurrent programming in OpenCL vs Grand Central Dispatch


With the introduction of OpenCL 2.0, OpenCL seems to have many of the features of Grand Central Dispatch (GCD), such as CLang/Apple style blocks and queues. Looking at their respective feature sets, I am wondering if OpenCL can do everything the GCD/libdispatch can do, but with the added ability of directing the computation to the GPU as well as the CPU — or if GCD has more to offer that stands apart from OpenCL.

Specifically my questions are:

  1. What differentiates the concurrency features of GCD and OpenCL?

  2. If there is value in using them together (assuming GCD offers added features), can C blocks be routed to either GCD queues or OpenCL queues? And if targeting the CPU, would there be a reason to pass through OpenCL vs running directly

  3. Does OpenCL 2.0 offer GCD style load balancing that can fill threads spanning both the CPU and GPU?


Solution

  • In its current form, OpenCL is able to address both data-parallel and task parallel problems, as evidenced by the different OpenCL API primitives to enqueue work:

    So while OpenCL clearly has its origins in data-parallel processing on GPUs and is still most suitable to process data that can somehow be forced in a 1, 2 or three-dimensional grid, task-oriented processing is also possible. The benefits become more apparent once one starts targeting heterogeneous systems with multiple CPUs, GPUs, DSPs and accelerators as OpenCL can target all those devices with one formalism.

    GCD on the other hand provides a convenient library that relieves the developer from much of the burden of managing and scheduling tasks by building on the concept of queues (of different types and priorities). Using GCD can therefore result in less error prone and more compact code on symmetric multiprocessing systems.

    So while OpenCL and GCD have different backgrounds at its origin (besides the fact that they both came from Apple), they both use queues at their foundation to manage work item distribution. They both have a concept of "context" to define data access.

    Since OS X version 10.7 it is possible to use GCD to dispatch OpenCL kernels (similar to blocks) to devices that support OpenCL, opening the potential for combining the benefits/strengths of both OpenCL and GCD.

    Here is an attempt to some answers/insights w.r.t. your specific questions:

    1 - What differentiates the concurrency features of GCD and OpenCL?

    As pointed out by @Dithermaster and @sharpneli, GCD originally targeted task-oriented (symmetric multi-) processing while OpenCL initially was meant for data-parallel processing on heterogeneous architectures.

    One main difference between the OpenCL and GCD is in the queuing mechanism. For example, while both OpenCL and GCD support synchronous and asynchronous execution, GCD has three priorities for the global asynchronous queues. The OpenCL runtime doesn't have this (it has out-of-order execution of work-items, but it is not defined which work items the runtime will execute first).

    The GCD manuals also indicate that a task is more lightweight than a traditional thread, making it possible to spawn a GCD task an much less instructions than a thread.

    Another difference is in the memory consistency model used. OpenCL uses a relaxed model for kernels, with global, local, private and constant memory. GCD does not have this.

    On the other hand, OpenCL has vector data types and vector intrinsics allowing for directly tapping in the SIMD potential of an architecture without relying on the compiler. On some architectures this is beneficial while other architectures (like MIC) recommend not to vectorize manually.

    Finally - not really a concurrency feature though - OpenCL has functions that allow to read and write image types, essentially giving you direct access to texture memory. This can often be used to obtain significant speedups even for algorithms unrelated to image processing.

    2 - If there is value in using them together (assuming GCD offers added features), can C blocks be routed to either GCD queues or OpenCL queues? And if targeting the CPU, would there be a reason to pass through OpenCL vs running directly

    By using GCD and OpenCL together you are able to address any device that supports OpenCL. So you are able to use the potentially heterogeneous nature of your platform while still being able to benefit from the higher level mechanisms that GCD provides to make multi-threading easier. Writing everything using the OpenCL C API (or even the C++ API) would likely result in slightly more code.

    Additionally GCD provides primitives like gcl_get_kernel_block_workgroup_info which can recommend an optimal work group size for your kernel.

    However, in my understanding it is not possible to route arbitrary C blocks to either GCD or OpenCL queues. C blocks can only go to non-OpenCL queues. OpenCL kernels can only be dispatched (from the host side) to queues for devices that support OpenCL. From the device side (so from within an OpenCL kernel) a block can only be dispatched to the same device.

    3 - Does OpenCL 2.0 offer GCD style load balancing that can fill threads spanning both the CPU and GPU?

    No, OpenCL 2.0 does not really define how to do load balancing, neither from the point of view of the host, nor from the device.

    However, on the host side one could easily split up a computation and run part of it say on the CPU, and another part on one or more GPUs. One way to do this is by playing with the work group sizes and the work group offset and only copying the minimum data needed for each device. Autotuning mechanisms can then be used to figure out what is the best load balancing between the different devices used.

    Once OpenCL 2.0 drivers become available, the newly introduced pipes, dynamic parallelism and shared virtual memory will give more possibilities for efficiently partitioning work between devices. It is not clear if and how these features will become available through GCD.