[SOLVED] Weird behaviour of CUDA recursion

Weird behaviour of CUDA recursion

In the following minimal reproducible example, when the recursion in device_func is active, the __synchthreads() barrier is ignored, and when debugged, breakpoint 2 occurs before breakpoint 1. If the recursion is removed, it works as expected. How could this be? The code is compiled with nvcc -arch=sm_61 -G -g example.cu for an NVIDIA Quadro P600, using CUDA Toolkit 12.5.

#include <cstdint>
#include <cstddef>
#include <cstdio>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

__device__ void device_func(uint32_t *octants, size_t const level)
{
    if (threadIdx.x == 0)
    {
        octants[0] = 0; //! breakpoint 1
    }
    __syncthreads();
    printf("first octant: %d\n", octants[0]); //! breakpoint 2
    //! if used, breakpoint 2 occurs before breakpoint 1
    if (level < 2)
    {
        device_func(octants, level + 1);
    };
}

__global__ void kernel()
{
    __shared__ uint32_t octants[9];

    if (threadIdx.x == 0)
    {
        octants[0] = 99999;
    }

    __syncthreads();
    device_func(octants, 0);
    return;
}

void checkCudaError(cudaError_t err, const char *msg)
{
    if (err != cudaSuccess)
    {
        fprintf(stderr, "CUDA error at %s: %s\n", msg, cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }
}

int main()
{
    kernel<<<1, 32>>>();

    // Check for kernel launch errors
    checkCudaError(cudaGetLastError(), "Kernel launch");

    // Wait for the kernel to finish executing
    checkCudaError(cudaDeviceSynchronize(), "Kernel execution");

    return 0;
}

Here is the debug session:

$ nvcc -o example -arch=sm_61 -G -g example.cu 
ptxas warning : Stack size for entry function '_Z6kernelv' cannot be statically determined
$ cuda-gdb ./example
NVIDIA (R) cuda-gdb 12.5
Portions Copyright (C) 2007-2024 NVIDIA Corporation
Based on GNU gdb 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This CUDA-GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://forums.developer.nvidia.com/c/developer-tools/cuda-developer-tools/cuda-gdb>.
Find the CUDA-GDB manual and other documentation resources online at:
    <https://docs.nvidia.com/cuda/cuda-gdb/index.html>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./example...
(cuda-gdb) b 11
Breakpoint 1 at 0xabfc: file /home/marco/phase_2_barnes_hut/example.cu, line 20.
(cuda-gdb) b 14
Note: breakpoint 1 also set at pc 0xabfc.
Breakpoint 2 at 0xabfc: file /home/marco/phase_2_barnes_hut/example.cu, line 20.
(cuda-gdb) r
Starting program: /home/marco/phase_2_barnes_hut/example 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6dff000 (LWP 1519)]
[New Thread 0x7ffff5b21000 (LWP 1520)]
[Detaching after fork from child process 1521]
[New Thread 0x7ffff5320000 (LWP 1529)]
[Thread 0x7ffff5320000 (LWP 1529) exited]
[New Thread 0x7ffff5320000 (LWP 1530)]
[New Thread 0x7fffe1fff000 (LWP 1532)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]

CUDA thread hit Breakpoint 2, device_func (octants=0x7fffe5000000, level=0) at example.cu:14
14          printf("first octant: %d\n", octants[0]); //! breakpoint 2
(cuda-gdb)

Solution

After filing a bug, I was told that this is expected for a Pascal GPU, since divergent threads are not supported. More info here.