I compile axpy.cu with the following command.
nvcc --cuda axpy.cu -o axpy.cu.cpp.ii
Within axpy.cu.cpp.ii
, I observe that function cudaLaunchKernel
nested in __device_stub__Z4axpyfPfS_
accepts an function pointer to axpy
which is defined in axpy.cu.cpp.ii
. So my confuse is that shouldn't cudaLaunchKernel
have been passed an function pointer to kernel function axpy
? Why is there function definition with the same name as kernel function? Any help would be appreciated! Thanks in advance
Both of functions are shown below.
void __device_stub__Z4axpyfPfS_(float __par0, float *__par1, float *__par2){
void * __args_arr[3];
int __args_idx = 0;
__args_arr[__args_idx] = (void *)(char *)&__par0;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par1;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par2;
++__args_idx;
{
volatile static char *__f __attribute__((unused));
__f = ((char *)((void ( *)(float, float *, float *))axpy));
dim3 __gridDim, __blockDim;
size_t __sharedMem;
cudaStream_t __stream;
if (__cudaPopCallConfiguration(&__gridDim, &__blockDim, &__sharedMem, &__stream) != cudaSuccess)
return;
if (__args_idx == 0) {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[__args_idx], __sharedMem, __stream);
} else {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[0], __sharedMem, __stream);
}
};
}
void axpy( float __cuda_0,float *__cuda_1,float *__cuda_2)
# 3 "axpy.cu"
{
__device_stub__Z4axpyfPfS_( __cuda_0,__cuda_1,__cuda_2);
}
So my confuse [sic] is that shouldn't
cudaLaunchKernel
have been passed an function pointer to kernel functionaxpy
?
No, because that isn't the design which NVIDIA chose and your assumption about how this function works are probably not correct. As I understand it, the first argument to cudaLaunchKernel
is treated as a key, not a function pointer which is called. Elsewhere in the nvcc emitted code you will find something like this boilerplate:
static void __nv_cudaEntityRegisterCallback( void **__T0)
{
__nv_dummy_param_ref(__T0);
__nv_save_fatbinhandle_for_managed_rt(__T0);
__cudaRegisterEntry(__T0, ((void ( *)(float, float *, float *))axpy), _Z4axpyfPfS_, (-1));
}
You can see that the __cudaRegisterEntry
call takes both a static pointer to axpy
and a form the mangled symbol for the compiled GPU kernel. __cudaRegisterEntry
is an internal, completely undocumented API from the CUDA runtime API. Many years ago, I satisfied myself by reverse engineering an earlier version of the CUDA runtime API, that there is an internal lookup mechanism which allows the correct instance of a host kernel stub to be matched to the correct instance of the compiled GPU code at runtime. The compiler emits a large amount of boilerplate and statically defined objects holding all the necessary definitions to make the runtime API work seamlessly without all of the additional API overhead that you need to use in the CUDA driver API or comparable compute APIs like OpenCL.
Why is there function definition with the same name as kernel function?
Because that is how NVIDIA decided to implement the runtime API. What you are asking about are undocumented, internal implementation details of the runtime API. You as a programmer are not supposed to have to see or use them, and they are not guaranteed to be the same from version to version.
If, as indicated in comments, you want to implement some additional compiler infrastructure in the CUDA compilation process, use the CUDA driver API, not the runtime API.