c++ccudaptxas

CUDA: --ptxas-options=-v shared memory and cudaFuncAttributes.sharedSizeBytes do not match


I'm trying to use reqRegs and sharedSizeBytes in the cudaFuncAttributes struct to dynamically optimise a kernel's block size at runtime.

My current implementation skims the stdout text from nvcc --ptxas-options=-v to discover the register and shared memory usage of the kernel. This method is a little hacky, and is dependent on the exact format of output text from --ptxas-options=-v, which could change without warning.

My problem is that I'm seeing a discrepancy between the 'smem' shared memory value reported in the --ptxas-options=-v output, and sharedSizeBytes in the cudaFuncAttributes struct, which has me worried that either the shared memory estimation I have been using until now is wrong, or that the sharedSizeBytes variable is unreliable, meaning I cannot use it for the purposes of runtime block size optimisation. Here is the output of nvcc --ptxas-options=-v for one such kernel ...

ptxas info    : Used 14 registers, 2088 bytes smem, 48 bytes cmem[1]

... compared with the value of cudaFuncAttributes.sharedSizeBytes = 296 at runtime, for the exact same kernel. Does anybody know what could be happening here?

Here is another example with a different kernel:

ptxas info    : Used 18 registers, 2132 bytes smem, 48 bytes cmem[1]

where cudaFuncAttributes.sharedSizeBytes = 340 at runtime.

Thanks.


Solution

  • Thank you Robert and Marco for your replies. They helped me to rule out a few cases.

    It turns out that the mismatch in reported shared memory usage resulted from the amount of shared memory used after the first test compile (reported by --ptxas-options=-v) being different from the amount of shared memory being used by the final program with revised block size (reported by cudaFuncAttributes.sharedSizeBytes). (EDIT for clarity)

    The shared memory difference was caused by shared memory array allocations being dependent on block size; for example:

    __shared__ float myArray[BLOCK_SIZE];
    

    The statement above uses a different amount of shared memory, in a program with a block size of 256, than the same source code compiled with an optimised block size of 192. It seems obvious now, but something to watch out for in optimised CUDA code generation.