cudalinker-errorsnvccgpu-cooperative-groups

CUDA Cooperative Groups : Linking error


After reading about Cooperative Groups in CUDA 9, I've been trying synchronize at a grid level.

I'm using Visual Studio 2017, a GTX 1060 and CUDA 9.1.

I altered my code as follows:

__global__ void ExplicitKernel_American(/* ... */) {
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    auto grid = cooperative_groups::this_grid();
    if (i < sizeS) {
        //...
        for (int j = 1; j < sizeT; ++j) {
            // ...
            grid.sync(); // __syncthreads();
        }
    }
}

And, as stated in the documentation, I call my kernel this way :

void* Explicit_Args[] = { &PDE_Grid, /* ... */, &sizeS, &sizeT };
cudaLaunchCooperativeKernel(
    (void*)ExplicitKernel_American, 
    dim3((sizeS + TPB - 1) / TPB), 
    dim3(TPB),  
    Explicit_Args
); // TPB being 256...

Unfortunately, I get linking errors as soon as I add the "grid" part in the kernel.

error LNK2001: unresolved external symbol __fatbinwrap_38_cuda_device_runtime_compute_70_cpp1_ii_8b1a5d37
fatal error LNK1120: 1 unresolved externals

I've set -rdc=true and sm_61 but cannot find why it is not working... Any ideas ?

Many thanks !


Solution

  • Use of a cooperative kernel launch (cooperative grid - CG) requires a Pascal or Volta GPU, and requires either Linux or a windows device operating in TCC mode. If you test the deviceProp.cooperativeLaunch property in the device properties structure, I think you will find that it is not supported on your GPU operating in WDDM mode.

    It's good practice to test this property in your code, before attempting to use a cooperative grid launch.

    The issue you are asking about is a compile/link issue, however. For that, my recommendation is to study a CG (cooperative grid) sample code, such as 6_Advanced/reductionMultiBlockCG. For grid sync, its definitely a requirement to set -rdc=true (i.e. enable relocatable device code linking). Depending on how you set -rdc=true, it may not be applied to your project correctly. The correct methodology is outlined here

    The proximal issue here appears to be that you are not correctly linking against the device runtime library, e.g. -lcudadevrt

    EDIT: Recently, this may be supported in WDDM GPUs. Best practice is always to decide support based on the mentioned attribute/property.