cudagpu-shared-memorywarp-scheduler

cuda shared memory and block execution scheduling


I would like to clear up an execution state with CUDA shared memory and block execution based on the amount of shared memory used per block.

State

I target on GTX480 nvidia card which has 48KB shared memory per block and 15 streaming multiprocessors. So, if i declare a kernel with 15 blocks, each one uses 48KB of shared memory and no other restriction is reached (registers, maximum threads per block etc.) every block is running into one SM(of 15) until the end. In this case is needed only scheduling between warps of the same block.

Question

So, my misunderstanding scenario is:
I call a kernel with 30 blocks so that 2 blocks reside on each SM. Now scheduler on each SM have to deal with warps from different blocks. But only when one block finishes its execution, warps of the other block is executed on SM because of shared memory entire amount (48KB per SM) usage. If this doesn't happen and warps of different blocks scheduling for execution on the same SM the result may be wrong because one block can read values loaded from the other in shared memory. Am I right?


Solution

  • You don't need to worry about this. As you have correctly said, if only one block fits per SM because of the amount of shared memory used, only one block will be scheduled at any one time. So there is no chance of memory corruption caused by overcommitting shared memory.


    BTW for performance reasons it is usually better to have at least two blocks running per SM because

    Of course there may be reasons why more shared memory per block gives a larger speedup than running multiple blocks per SM would.