c++cudaconstantsconstexpr

C++ builtin constexpr vs CUDA __constant__ for higher dimension array


I recently need to implement a Sobol sequence generator by CUDA.

Now to pre-store the Sobol table, I can either use a C++ native constexpr like below:

constexpr unsigned int sobol_table[NUM_DIMENSIONS][32] = {
            {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824},
            {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647}
            // skipped for higher dim...
};

Or to use a CUDA __constant__, like below:

__constant__ unsigned int sobol_table[NUM_DIMENSIONS][32] = {
    {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648},
    {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647, 4294967295},
    // skipped for higher dim...
};

I wonder what are the tradeoffs? Is there anything I need to worry about other than increased binary file size by constexpr, GPU constant memory, etc?

N.B. I am aware that cuRAND has native support of it. The reason why I need to build a wheel is to conform to previous behavior of the tool, which may have used a possibly mutated Sobol sequence for its life cycle...


Solution

  • I think the key thing not to forget in your question is that in CUDA __constant__ is not just a language specifier like const or constexpr in C++, it is a memory space specifier. That means that the data for that variable is stored in a specific reserved bank of DRAM on the GPU, and the device code compiler will know to emit uniform memory access PTX instructions whenever that variable is read from in code, and that instruction has the effect of accessing the memory through a read-only cache. This is very efficient when every thread in a warp needs the same data values in lock-step (for example flexible finite difference stencils, or quadrature rules where loops within every thread consume coefficients or weights in a uniform way). But it still entails DRAM access, which would be inherently slower than an-inline constant in code. And it also could be a potential source of warp divergence if the memory access pattern is not uniform and you wind up with cache misses. This requires use dependent analysis.

    On the other hand, in the constexpr case, things are somewhat simpler and you are relying on all the C++ language lawyer goodness to eventually signal to the compiler that only compile time constant expressions are allowed and let it go ahead and (potentially) give you some optimization(s) in the emitted code.

    For the specific use case of Sobol pseudo-random sequence generators, I can't comment on which approach is "better" because I have never bothered to study their internals and just use known "good" libraries (paraphrasing Von Neumann, random number generation is too important to be left to chance).

    In the end, the only way to be sure would be to look at the PTX emitted by the compiler for either example and see what suits your tastes better, and then benchmark, and then make a choice based on whatever purist or pragmatist criteria you wish to apply.


    The final triviality is that (traditionally), this:

    __constant__ unsigned int sobol_table[NUM_DIMENSIONS][32] = {
        {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648},
        {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647, 4294967295},
        // skipped for higher dim...
    };
    

    was actually illegal in CUDA because you are initializing device (in the case __constant__) memory in host code. The usual way to do this would be:

    __constant__ unsigned int sobol_table[NUM_DIMENSIONS][32];
    const unsigned int sobol_table_data[NUM_DIMENSIONS][32] = {
        {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, 2147483648},
        {1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 134217727, 268435455, 536870911, 1073741823, 2147483647, 4294967295},
        // skipped for higher dim...
    };
    
    cudaMemcpyToSymbol(sobol_table, &sobol_table_data[0][0], sizeof(int)*NUM_DIMENSIONS*32);
    

    It seems that at some point (perhaps as part of the CUDA 10 release cycle, but I couldn't find any mention of it in either documentation or changelogs) the invisible host boilerplate generation performed by the toolchain was improved so that there is an internal function __nv_cudaEntityRegisterCallback which calls the undocumented __cudaRegisterVar runtime API function for each device variable to automagically do the setup and initialization of the variable from the internal fat binary data when the runtime API context is lazily established. For about the first 10 years of CUDA, that would probably emit a compiler error, or a warning and just not work if it did compile.