The function signature of my kernel is as follows:
template< size_t S, typename Field, typename Type1, typename Type2>
void kernel(const Type1 arg1, const Type2 arg2, Field *results) {
// S is known at compile time
// Field might be float or double
// Type1 is an object holding data and also methods
// Type2 is an object holding data and also methods
// The computation start here
}
I know that is possible to use a subset of the features of c++ to write the kernel using an extension to the implementations of OpenCL from AMD but the resulting code is restricted to run on AMD cards only.
The standard specification of OpenCL language for versions previous to 2.0 constraint the programmer to use C99 for writing the kernels and I believe that versions 2.1 and 2.2 are not widely available for Linux distros yet. However, I found here that Boost::compute allows to some extent to use a subset of c++ features in the specification of the kernels. However is not clear if it is possible to implement a kernel signature as in the code snippet above using Boos::compute. To which extent is it possible to implement such a kernel? code examples will be very appreciated.
TL;DR: yes and no. It is indeed possible to some extent to write templated kernels, but those aren't nearly as powerful as their CUDA counterpart.
I know that is possible to use a subset of the features of c++ to write the kernel using an extension to the implementations of OpenCL from AMD but the resulting code is restricted to run on AMD cards only.
It isn't restricted to run on AMD cards only. It is restricted to be compiled on AMD's OpenCL implementation only. For example, it should run on Intel CPUs just fine, as long as it's compiled on AMD's implementation.
I found here that Boost::compute allows to some extent to use a subset of c++ features in the specification of the kernels. However is not clear if it is possible to implement a kernel signature as in the code snippet above using Boos::compute.
Boost.Compute is essentially a fancy abstraction layer above the OpenCL C API to make it more palatable and less tedious to work with, but it still gives you full access to the underlying C API. This means that if something is feasible from the C API, it should in theory also be feasible from Boost.Compute.
Since OpenCL code is compiled at runtime, in a separate pass, you won't be able to automatically do template instantiation the way CUDA does it at compile time. The CUDA compiler sees both host and device code and can do proper template instatiation across the entire call graph, as if it were a single translation unit. This is impossible in OpenCL, by design.
1. You will have to manually instantiate all the possible template instatiations you need, mangle their name, and dispatch to the proper instantiation.
2. All types used in template instantiations must be defined in OpenCL code too.
This restriction makes OpenCL templated kernels not entirely useless, but also not very practical compared to CUDA ones. Their main purpose is to avoid code duplication.
Another consequence of this design is that non-type template parameters aren't allowed in kernel templates template argument lists (at least as far as I know, but I would really like to be wrong on this one!). This means you'll have to lower the non-type template parameter of the kernel template into a non-type template parameter of one of the arguments. In other words, transform something that looks like this:
template<std::size_t Size, typename Thing>
void kernel(Thing t);
Into something like this:
template<typename Size, typename Thing>
void kernel(Size* s, Thing t);
And then distinguishing different instantiations by using something similar in spirit to std::integral_constant<std::size_t, 512>
(or any other type that can be templated on an integer constant) as first argument. The pointer here is just a trick to avoid requiring a host-side definition of the size type (because we don't care about it).
Disclaimer: my system doesn't support OpenCL, so I could not test the below code. It probably requires some tweaking to work as expected. It does compile, however.
auto source = R"_cl_source_(
// Type that holds a compile-time size.
template<std::size_t Size>
struct size_constant {
static const std::size_t value = Size;
};
// Those should probably be defined somewhere else since
// the host needs to know about them too.
struct Thing1 {};
struct Thing2 {};
// Primary template, this is where you write your general code.
template<typename Size, typename Field, typename Type1, typename Type2>
kernel void generic_kernel(Size*, const Type1 arg1, const Type2 arg2, Field *results) {
// S is known at compile time
// Field might be float or double
// Type1 is an object holding data and also methods
// Type2 is an object holding data and also methods
// The computation start here
// for (std::size_t s = 0; s < Size::value; ++s)
// ...
}
// Instantiate the template as many times as needed.
// As you can see, this can very quickly become explosive in number of combinations.
template __attribute__((mangled_name(kernel_512_float_thing1_thing2)))
kernel void generic_kernel(size_constant<512>*, const Thing1, const Thing2, float*);
template __attribute__((mangled_name(kernel_1024_float_thing1_thing2)))
kernel void generic_kernel(size_constant<1024>*, const Thing1, const Thing2, float*);
template __attribute__((mangled_name(kernel_1024_double_thing1_thing2)))
kernel void generic_kernel(size_constant<1024>*, const Thing1, const Thing2, double*);
)_cl_source_";
namespace compute = boost::compute;
auto device = compute::system::default_device();
auto context = compute::context { device };
auto queue = compute::command_queue { context, device };
// Build the program.
auto program = compute::program::build_with_source(source, context, "-x clc++");
// Retrieve the kernel entry points.
auto kernel_512_float_thing1_thing2 = program.create_kernel("kernel_512_float_thing1_thing2");
auto kernel_1024_float_thing1_thing2 = program.create_kernel("kernel_1024_float_thing1_thing2");
auto kernel_1024_double_thing1_thing2 = program.create_kernel("kernel_1024_double_thing1_thing2");
// Now you can call these kernels like any other kernel.
// Remember: the first argument is just a dummy.
kernel_512_float_thing1_thing2.set_arg(0, sizeof(std::nullptr_t), nullptr);
// TODO: Set other arguments (not done in this example)
// Finally submit the kernel to the command queue.
auto global_work_size = 512;
auto local_work_size = 64;
queue.enqueue_1d_range_kernel(kernel_512_float_thing1_thing2, 0, global_work_size, local_work_size);
Good luck and feel free to edit this post with your changes so that others may benefit from it!