cudanvccptxcompute-capability

CUDA device properties and compute capability when compiling


Let's assume I have a code which lets the user pass the threads_per_block to call the kernel. Then I want to check, if the input is valid (e.g. <=512 for compute capability CC <2.0 and 1024 for CC >=2.0).

Now I wonder what would happen if I compile the code with nvcc -arch=sm_13 while having a graphics card in my computer with CC2.0, when the user passes threads_per_block == 1024? Is this:

Or does the nvcc -arch=sm_13 just mean that CC1.3 is at least necessary but when running it on higher CC, those higher features can although be used?


Solution

  • From the nvcc manual:

    -arch

    The architecture specified by this option is the architecture that is assumed by the compilation chain up to the ptx stage, ...

    This means it specifies what PTX features (like special instructions) the compiler can use. The maximum number of threads per block is not specified by the PTX ISA, and thus this compiler parameter is not relevant to the problem you're trying to solve.

    The best way to check if threads_per_block is valid, is to just launch the kernel and see if any errors occur.