[SOLVED] CUDA shared memory occupancy

CUDA shared memory occupancy

If I have a 48kB shared memory per SM and I write a kernel where I allocate 32kB shared memory that means that only 1 block can be running on one SM at the same time?

Solution

Yes, that is correct.

Shared memory must support the "footprint" of all "resident" threadblocks. In order for a threadblock to be launched on a SM, there must be enough shared memory to support it. If not, it will wait until the presently executing threadblock has finished.

There is some nuance to this arriving with Maxwell GPUs (cc 5.0, 5.2). These GPUs support either 64KB (cc 5.0) or 96KB (cc 5.2) of shared memory. In this case, the maximum shared memory available to a single threadblock is still limited to 48KB, but multiple threadblocks may use more than 48KB in aggregate, on a single SM. This means a cc 5.2 SM could support 2 threadblocks, even if both were using 32KB shared memory.