templatesrecursioncudanvcc

Template excessive recursion at instantiation cuda


I have a template function where the template parameter is an integer. This integer is used to create different kernels. Previously all the possible templates where instantiated by hand in a table (works but ugly) but I tried to use the solution proposed here. Because I have more that 800 possible kernels the template recursion method is much more elegant. I have tested the template recursion on the C++ version of my code and it works perfectly, but nvcc seems to limit the recursion of my instantiation.

Here a simplified example of my previous ugly template instantiation list that works properly (even with the 800 kernel instantiations):

// the template kernel 
template <int i> __global__ void kernel(int some_data)
{
    switch(i)
    {
    case 0:
        // do something
        break;
    case 1:
        // do some other things
        break;
    //...
    case 799:
        // do some other things
        break;
    }
}

typedef void (*kernel_pointer) (int some_data)

// the ugly huge list
kernel_pointer kernel_list[800] = {
    &kernel <0>,
    &kernel <1>,
    //...
    &kernel <799> }

int main()
{
    int kernel_index = 10;

    //the call
    kernel_pointer my_kernel = kernel_list[kernel_index];
    my_kernel<<<<1,1>>>>(the_data);        
}

Here the beautiful template recursion that nvcc doesn't like. It replace the list in the previous piece of code:

#define N_KERNELS 800
template< int i> bool dispatch_init( kernel_pointer* pTable )
{
    pTable[i] = &kernel<i>;    
    return dispatch_init<i-1>( pTable );
}    
// edge case of recursion
template<> bool dispatch_init<-1>(kernel_pointer* pTable) { return true; }

// call the recursive function
const bool initialized = dispatch_init<-1>( kernel_list );

In reality I don't have a single template parameter but 6 that combines to create all the hundreds combinations. Else, a switch with 800 cases would be really stupid. Does anyone have an idea to increase the nvcc template recursion limit or another automatic way to create my list?

Edit: I found the gcc option ftemplate-depth that changes the instantiation recursion limit but I haven't found an equivalent nvcc option.


Solution

  • From the idea proposed by Robert Crovella, which consists in building the table in few pieces, here is an example of how to fix the "error" :

    #define N_KERNELS 850
    // template kernel 
    template <int i> __global__ void kernel(int a)
    {
        switch(i)
        {
        case 0:
            printf("%d\n", a*i);
            break;
        case 1:
            printf("%d\n", a*i);
            break;
        //...
        case 849:
            printf("%d\n", a*i);
            break;
        }
    }
    
    typedef void (*kernel_pointer) (int);
    
    kernel_pointer kernel_list[N_KERNELS];
    
    // Function that instantiates all the needed kernels using recursion.
    template< int i> bool dispatch_init( kernel_pointer* pTable )
    {
        pTable[i] = &kernel<i>;    
        return dispatch_init<i+1>( pTable );
    }    
    
    // Edge cases of recursion made with a template specialization
    template<> bool dispatch_init<199>(kernel_pointer* pTable)
    {
        pTable[199] = &kernel<199>;
        return true;
    }
    template<> bool dispatch_init<399>(kernel_pointer* pTable) 
    {
        pTable[399] = &kernel<399>;
        return true;
    }
    template<> bool dispatch_init<599>(kernel_pointer* pTable)
    {
        pTable[599] = &kernel<599>;
        return true;
    }
    template<> bool dispatch_init<799>(kernel_pointer* pTable)
    {
        pTable[799] = &kernel<799>;
        return true;
    }
    template<> bool dispatch_init<N_KERNELS>(kernel_pointer* pTable) { return true; }
    
    // Call the recursive function few times to instantiate all the kernels without reaching the recursive instantiation limit
    const bool initialized = dispatch_init<0  >( kernel_list );
    const bool initialized = dispatch_init<200>( kernel_list );
    const bool initialized = dispatch_init<400>( kernel_list );
    const bool initialized = dispatch_init<600>( kernel_list );
    const bool initialized = dispatch_init<800>( kernel_list );
    
    
    int main()
    {
        int kernel_index = 10;
    
        kernel_pointer my_kernel = kernel_list[kernel_index];
        my_kernel<<<<1,1>>>>(6);        
    }
    

    I don't like this fix but it will do the trick for the moment. If a Nvidia developer comes here some day, it should be a good idea to add the option "ftemplate-depth" to nvcc, no?