cudagpu-shared-memory

Use of Mixture of Static and Dynamic Shared Memory in Nested Arrays for Cuda Kernels


I sometimes see the following shared memory declaration in CUDA kernels, and I am not sure what it means:

extern __shared__ T shmem[][SZ]

with SZ being a compile-time constant. The kernel is launched as:

kernel<<<grid, block, shared_memory_size>>>()

My questions are:


Solution

  • __extern__ shared T shmem[][SZ]

    The proper syntax is

    extern __shared__ T shmem[][SZ];
    

    Why could it be useful to have a mixture of dynamic and static shared memory (apart from simplifying the programmer's address computations)?

    I believe address computation is the only reason. Of course this also factors into static analysis etc. but that's really all that it is; making your work easier.

    Of course if you make SZ a runtime variable even though it could be compile time, then the computational cost also goes up because then the GPU has to do proper integer multiplication for its address generation instead of optimizing it into cheaper operations such as bit shifts.

    What is the total size of the shared memory buffer shmem? Is it shared_memory_size * SZ?

    No, the kernel launch parameter is in bytes. So the size along the outer dimension would be shared_memory_size / (SZ * sizeof(T)). To put it the other way around, if you want shmem[N][SZ] at runtime, then shared_memory_size = N * SZ * sizeof(T)

    Suppose I can compute shared_memory_size at compile time how would I have to rewrite the shared memory declaration for it to be static? shared T shmem[SZ*shared_memory_size]?

    You mean like this?

    __shared__ T shmem[N][SZ];
    

    Note how it is no longer declared extern.

    Footnote: In terms of nomenclature I would not call this mixing static and dynamic shared memory. I believe most people would interpret mixing as using __shared__ T and extern __shared__ T in the same kernel.