[SOLVED] CUDA shared memory under the hood questions

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.

First, as mentioned in this blog post, shared memory may declare in two different ways:

Either dynamically shared memory allocated, like the following

// Lunch the kernel 
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);

This may use inside a kernel as mention:

extern __shared__ int s[];

Or static shared memory, which can use in kernel call like the following:

__shared__ int s[64];

Both are use for different reasons, however which one is better and why?

Second, I'm running a multi-block, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them use shared memory. An example is given:

__global__ void startKernel(float* p_d_array) 
{
    __shared double matA[3*3];

    float a1 =0 ; 
    float a2 = 0;
    float a3 = 0; 
    float b = p_d_array[threadidx.x]; 
    
    a1 += reduce( b, threadidx.x); 
    a2 += reduce( b, threadidx.x); 
    a3 += reduce( b, threadidx.x); 

    // continue... 
}

 __device__ reduce ( float data , unsigned int tid) 
{
    __shared__ float data[256]; 
    // do reduce ...   
}

I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.

What's happening when block # 0 goes into the reduce function?

Does the shared memory get allocated in advance to the function call?

I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?

Even though if they are called from two different function calls ?

On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.

Third, is it possible to gain higher performance in shared memory due to compiler optimization using

const float* p_shared ;

or __restrict__ keyword after the data assignment section?

Solution

AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.

Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.