Suppose I compile the following with NVIDIA CUDA's nvcc compiler:
template<typename T, typename Operator>
__global__ void fooKernel(T t1, T t2) {
Operator op;
doSomethingWith(t1, t2);
}
template<typename T>
__device__ __host__ void T bar(T t1, T t2) {
return t1 + t2;
}
template<typename T, typename Operator>
void foo(T t1, T t2) {
fooKernel<<<2, 2>>>(t1, t2);
}
// explicit instantiation
template decltype(foo<int, bar<int>>) foo<int, bar<int>);
Now, I want my gcc, non-nvcc code to call foo()
:
...
template<typename T, typename Operator> void foo(T t1, T t2);
foo<int, bar<int>> (123, 456);
...
I have the appropriate (?) instantiation in the .o/.a/.so file I compile with CUDA.
Can I make that happen?
The problem here is that templated code is typically instantiated at the place of usage, which doesn't work because foo
contains a kernel call which cannot be parsed by g++. Your approach of explicitly instantiating the template and forward declaring it for the host compiler is the right one. Here's how to do this. I slightly fixed up your code and split it into 3 files:
This file contains the templated code for use by gpu.cu
. I added some purpose to your foo()
function to make sure it works.
#pragma once
#include <cuda_runtime.h>
template <typename T>
struct bar {
__device__ __host__ T operator()(T t1, T t2)
{
return t1 + t2;
}
};
template <template <typename> class Operator, typename T>
__global__ void fooKernel(T t1, T t2, T* t3)
{
Operator<T> op;
*t3 = op(t1, t2);
}
template <template <typename> class Operator, typename T>
T foo(T t1, T t2)
{
T* t3_d;
T t3_h;
cudaMalloc(&t3_d, sizeof(*t3_d));
fooKernel<Operator><<<1, 1>>>(t1, t2, t3_d);
cudaMemcpy(&t3_h, t3_d, sizeof(*t3_d), cudaMemcpyDeviceToHost);
cudaFree(t3_d);
return t3_h;
}
This file only instantiates the foo()
function to make sure it will be available for linking:
#include "gpu.cuh"
template int foo<bar>(int, int);
In this plain C++ source file, we need to make sure we do not get the template instantiations, as that would give a compile error. Instead we only forward declare the struct bar
and the function foo
. The code looks like this:
#include <cstdio>
template <template <typename> class Operator, typename T>
T foo(T t1, T t2);
template <typename T>
struct bar;
int main()
{
printf("%d \n", foo<bar>(3, 4));
}
This will put the code all together into an executable:
.PHONY: clean all
all: main
clean:
rm -f *.o main
main: gpu.o cpu.o
g++ -L/usr/local/cuda/lib64 $^ -lcudart -o $@
gpu.o: gpu.cu
nvcc -c -arch=sm_20 $< -o $@
cpu.o: cpu.cpp
g++ -c $< -o $@
Device code is compiled by nvcc
, host code by g++
and it all gets linked by g++
. Upon running you see the beautiful result:
7
The key thing to remember here is that kernel launches and kernel definitions have to be in the .cu
files that are compiled by nvcc
. For future reference, I will also leave this link here, on separation of linking and compilation with CUDA.