c++cudavariadic-templatesgpu-cooperative-groups

Can I launch a cooperative kernel without passing an array of pointers?


The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax:

my_kernel<<<grid_dims, block_dims, shared_mem_size>>>(
    first_arg, second_arg, and_as_many, as_we, want_to, etc, etc);

but as regards "collaborative" kernels, the CUDA Programming Guide says (section C.3):

To enable grid synchronization, when launching the kernel it is necessary to use, instead of the <<<...>>> execution configuration syntax, the cuLaunchCooperativeKernel CUDA runtime launch API:

cudaLaunchCooperativeKernel(
  const T *func,
  dim3 gridDim,
  dim3 blockDim,
  void **args,
  size_t sharedMem = 0,
  cudaStream_t stream = 0
)      

(or the CUDA driver equivalent).

I would rather not have to write my own wrapper code for building an array of pointers... is there really no facility in the runtime API to avoid that?


Solution

  • We can use something like the following workaround (requires --std=c++11 or a more recent C++ language standard):

    namespace detail {
    
    template <typename F, typename... Args>
    void for_each_argument_address(F f, Args&&... args) {
        [](...){}((void)(f( (void*) &std::forward<Args>(args) ), 0)...);
    }
    
    } // namespace detail
    
    template<typename KernelFunction, typename... KernelParameters>
    inline void cooperative_launch(
        const KernelFunction&       kernel_function,
        stream::id_t                stream_id,
        launch_configuration_t      launch_configuration,
        KernelParameters...         parameters)
    {
        void* arguments_ptrs[sizeof...(KernelParameters)];
        auto arg_index = 0;
        detail::for_each_argument_address(
            [&](void * x) {arguments_ptrs[arg_index++] = x;},
            parameters...);
        cudaLaunchCooperativeKernel<KernelFunction>(
            &kernel_function,
            launch_configuration.grid_dimensions,
            launch_configuration.block_dimensions,
            arguments_ptrs,
            launch_configuration.dynamic_shared_memory_size,
            stream_id);
    }
    

    Note: This uses Sean Parent's classic for_each_arg() one-liner. See also this post about it at FluentCPP.