c++cuda

Using vector types vs custom structures for 256-bit numbers in CUDA


I’m working on implementing 256-bit number arithmetic in CUDA for operations like addition with carry propagation, subtraction with borrow, and comparison. Initially, I defined a custom structure to represent the 256-bit number:

struct big_number_t {
    unsigned long long s0; // Least significant limb
    unsigned long long s1;
    unsigned long long s2;
    unsigned long long s3; // Most significant limb
};

However, I came across references to vector types in CUDA and found there are built-in types like ulong4 and ulonglong4.

  1. ulong4 or ulonglong4: I refer to the Cuda 12.8 documentation (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types-alignment-requirements-in-device-code) and I am unable to determine which of these types is correct.
  2. Memory access efficiency: Will using vector types lead to better coalesced memory access or optimized loads/stores compared to the custom structure?
  3. Computational performance: For operations that require carry propagation (like addition and subtraction across the four limbs), is there any advantage to using vector type over the custom structure? Since carry propagation is sequential, I’m not sure if vector types offer any benefits here.

Now, I’m wondering whether it’s better to use ulonglong4 or stick with my custom structure for representing 256-bit numbers. My primary concerns are.

Below is an example of how I might utilize the custom structure in code:

struct big_number_t {
    unsigned long long s0;
    unsigned long long s1;
    unsigned long long s2;
    unsigned long long s3;
};

__device__ big_number_t bn_add(const big_number_t& a, const big_number_t& b, unsigned long long& carry_out) {
    big_number_t result;
    unsigned long long carry = 0;

    result.s0 = a.s0 + b.s0;
    carry = (result.s0 < a.s0) ? 1 : 0;

    unsigned long long sum = a.s1 + b.s1 + carry;
    carry = (sum < a.s1 || (carry && sum == a.s1)) ? 1 : 0;
    result.s1 = sum;

    sum = a.s2 + b.s2 + carry;
    carry = (sum < a.s2 || (carry && sum == a.s2)) ? 1 : 0;
    result.s2 = sum;

    sum = a.s3 + b.s3 + carry;
    carry = (sum < a.s3 || (carry && sum == a.s3)) ? 1 : 0;
    result.s3 = sum;

    carry_out = carry;
    return result;
}

Would replacing big_number_t with ulonglong4 or ulong4 offer any performance benefits, or is it mostly a matter of preference?

I appreciate any insights or experiences from those who have worked with large integers in CUDA. Thank you!


Solution

  • Some of this is already covered in comments below the question.

    ulong4 or ulonglong4

    The first thing I would suggest is to use your C++ knowledge and note that these vector types are not entirely opaque. On a standard linux CUDA install I find the file "vector_types.h" in /usr/local/cuda/include. It has the following definitions, for example:

    struct __device_builtin__ __builtin_align__(16) ulonglong4
    {
        unsigned long long int x, y, z, w;
    };
    
    struct __device_builtin__ __builtin_align__(16) ulong4
    {
        unsigned long int x, y, z, w;
    };
    

    They are basically quite similar except for the underlying base type - unsigned long long int vs. unsigned long int. On Linux (platforms supported by CUDA), there should be no difference between these types. But on windows, the first type is a 64-bit (unsigned) integer whereas the second type is a 32-bit unsigned integer. This is a peculiarity of windows compared to specific Linux platforms; not unique or specific to CUDA, but CUDA doesn't deviate from that. So my guess would be you want to use unsigned long long int or ulonglong4.

    Memory access efficiency

    I suggest reading this section of the programming guide. It covers several key concepts. The important thing to note about the above struct definitions here is the _align__(16) directive. If you build your custom structs with a similar directive e.g.:

    struct __align__(16) {...
    

    there should be no functional or behavioral difference in terms of memory efficiency. I'm not going to cover all the concepts of the CUDA requirement for natural alignment, it's covered in the section I linked. But from a memory access efficiency perspective, the align directive allows a load of the structure if properly written to take place as two 128-bit loads rather than a larger number of e.g. 64- or 32-bit loads. A bigger memory efficiency issue on the structure load will be that loading 256 bits per thread (broken into two adjacent 128-bit loads) will possibly be less efficient than loading adjacent 128-bit loads in adjacent threads. I have no doubt that this sort of efficiency is one of the things the CGBN designers were shooting for, and one of the reasons they chose to do the work in multiple threads per 256-bit element rather than a single thread per 256-bit element. However, on a modern GPU I think the efficiency difference here may be small. Another reason to prefer multiple threads might be if the scope of the work is too small. A GPU works best when many threads are engaged. If you worked on, say, 1024 of your big integers at once, in 1024 threads, that is going to be only a small amount of parallelism exposed on the GPU. But instead if you can effectively have, say, 8 threads work on each number (doing work in 32-bit chunks) that is going to work out much better from an exposed parallelism perspective.

    That isn't a concern so much if you have large amount of exposed parallelism (say, 1 million numbers to work on at once) so if that is the case I would say proceed with whatever path seems best, and only consider CGBN if performance is still a concern.

    Computational performance

    As discussed in the comments, there isn't any GPU that does integer work in wider than 32-bits per SASS instruction, so I would suggest just proceeding with what you have in mind. The structure design isn't going to affect the math, because NVIDIA GPUs compute integers in 32-bit quantities, so larger integers are represented using 32-bit unsigned integers.. By the time you actually get to doing arithmetic, you will not be operating with the structure directly, but with its components. If you want examples of well-crafted routines, I would suggest these 1 2 may be of interest, but they deal with 128-bit integers, not 256-bit integers, so would require some extension of some sort. If that seems complicated, if it were me, I would simply start out with a C++ realization for the arithmetic that makes sense to you, and only consider more exotic things if performance is a concern, and the profiler steers you in a particular direction.