c++gccintrinsicsiccrdrand

RDRAND and RDSEED intrinsics on various compilers?


Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?

#include <immintrin.h>  // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);

And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?


Solution

  • All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
    Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)

    #include <immintrin.h>
    #include <stdint.h>
    
    // gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
    uint64_t rdrand64(){
        unsigned long long ret;   // not uint64_t, GCC/clang wouldn't compile.
        do{}while( !_rdrand64_step(&ret) );  // retry until success.
        return ret;
    }
    
    // and equivalent for _rdseed64_step
    // and 32 and 16-bit sizes with unsigned and unsigned short.
    

    Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
    ICX is LLVM-based and behaves like clang.
    MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.

    Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.

    Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
    But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.


    Compiler versions to support intrinsics

    Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.

    CPU gcc clang MSVC ICC
    rdrand Ivy Bridge / Excavator 4.6 3.2 before 2015 (19.10) before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
    rdseed Broadwell / Zen 1 9.1 7.0 before 2015 (19.10) before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)

    The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)

    Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)

    GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.

    Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.

    But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)

    Wrappers that compile everywhere

    This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)

    The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.

    There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.

    #include <immintrin.h>
    #include <stdint.h>
    
    #if defined(__x86_64__) || defined (_M_X64)
    // Figure out which 64-bit type the output arg uses
    #ifdef __INTEL_COMPILER       // Intel declares the output arg type differently from everyone(?) else
    // ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
    typedef uint64_t intrin_u64;
    #else
    // GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
    typedef unsigned long long intrin_u64;
    #endif
    
    //#if defined(__RDRND__) || defined(_MSC_VER)  // conditional definition if you want
    inline
    uint64_t rdrand64(){
        intrin_u64 ret;
        do{}while( !_rdrand64_step(&ret) );  // retry until success.
        return ret;
    }
    //#endif
    
    #if defined(__RDSEED__) || defined(_MSC_VER)
    inline
    uint64_t rdseed64(){
        intrin_u64 ret;
        do{}while( !_rdseed64_step(&ret) );   // retry until success.
        return ret;
    }
    #endif  // RDSEED
    #endif  // x86-64
    
    //__attribute__((target("rdrnd")))
    inline
    uint32_t rdrand32(){
        unsigned ret;      // Intel documents this as unsigned int, not necessarily uint32_t
        do{}while( !_rdrand32_step(&ret) );   // retry until success.
        return ret;
    }
    
    #if defined(__RDSEED__) || defined(_MSC_VER)
    //__attribute__((target("rdseed")))
    inline
    uint32_t rdseed32(){
        unsigned ret;      // Intel documents this as unsigned int, not necessarily uint32_t
        do{}while( !_rdseed32_step(&ret) );   // retry until success.
        return ret;
    }
    #endif
    

    The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.

    On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1

    # clang 7.0 -O3 -march=broadwell    MSVC -O2 does the same.
    rdrand64():
    .LBB0_1:                                # =>This Inner Loop Header: Depth=1
            rdrand  rax
            jae     .LBB0_1      # synonym for jnc - jump if Not Carry
            ret
    
    # same for other functions.
    

    Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:

    # gcc 12.1 -O3 -march=broadwell
    rdrand64():
            mov     edx, 1
    .L2:
            rdrand  rax
            mov     QWORD PTR [rsp-8], rax    # store into the red-zone where retval is allocated
            cmovc   eax, edx                  # materialize a 0 or 1  from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
            test    eax, eax                  # then test+branch on it
            je      .L2                       # could have just been jnc after rdrand
            mov     rax, QWORD PTR [rsp-8]     # reload retval
            ret
    
    rdseed64():
    .L7:
            rdseed  rax
            mov     QWORD PTR [rsp-8], rax   # dead store into the red-zone
            jnc     .L7
            ret
    

    ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.


    Footnote 1: rdrand/rdseed library wrappers

    librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).

    A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.

    libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.

    But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.


    Performance / internals of rdrand / rdseed

    For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.

    For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.

    Infinite retry count?

    The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.

    In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.

    Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.

    But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.

    If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.

    Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.

    @jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)

    Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.

    An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!

    If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.

    https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.

    These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:

    As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.

    Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.

    (Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)