c++cudaconstantsgpu-warp

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?


nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)

So, at least for that purpose you are motivated to have something like (edit):

enum : unsigned int { warp_size  = 32 };

somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?

Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.


Solution

  • Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).

    The day we get a different warp size (e.g. 64) many things will change:

    At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant. Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.

    Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:

    deviceFunction<warp_size>(params)
    

    This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.


    My suggestion would be to have a single header that control all the model-specific constants, e.g.

    #if __CUDA_ARCH__ <= 600
    //all devices of compute capability <= 6.0
    static const int warp_size = 32; 
    #endif
    

    Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.