I've been digging into "true" randomness idea, and I've noticed that modern CPUs support instructions for generating randomness. X64 has RDRAND instruction, while ARM has RNDR (I'm not interested in other archs at the moment).
On the other hand many modern operating systems provide a syscall for generating randomness, e.g. getentropy() on POSIX, or RtlGenRandom/ProcessPrng on Windows. And that is regardless of the CPU.
I assume that those syscalls exist as fallbacks, in case CPU does not support randomness out of the box (and then the operating system can try to imitate it by gathering various sources of noise). Or is there some other reason I should use syscalls even when CPU instructions already exist for that purpose?
I'm mostly familiar with x86 rdrand, but most of what I wrote could apply to similar instructions on other ISAs, except for the performance details. The fact that old CPUs exist without rdrand support is certain a reason not to use it directly for a lot of programs, also the fact that it's x86-specific. But a C++ standard library could use ISA-specific features if available for std::random_device.
One major reason to avoid direct usage of rdrand is to not put all your eggs in one basket, a basket which can't be 100% audited or verified to not have patterns or other weaknesses. Even if we trust the vendors, industrial espionage is a hypothetical possibility. Related:
And that's assuming CPUs are working as designed. It's also presumably possible for a CPUs HWRNG to break and always return failure status, so you need a fallback. (Or maybe not, the carry-flag indication of success/fail might only be used for temporary exhaustion in existing x86 designs.)
Worse, there have been CPUs with RDRAND bugs: AMD Ryzen 3xxx series (Zen 2) always returned 0xFFFFFFFF until a microcode update fixed it! https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend/ - Linux systemd using rdrand directly (to generate UUIDs very early in boot) caused lockups on systems with those CPUs, unlike if they'd gotten their entropy through the kernel which uses its own mixing functions and mixes in other sources of entropy (like low bits of the TSC when interrupts happen.)
AMD also recently had an RDSEED bug on Zen 5, where 10% of the time it would return 0 (while still indicating success with CF=1) if used with operand-size other than 64-bit. (Rather than the expected 2^-32 or 2^-16 chance of a 0 output.)
At least under certain microarchitectural conditions, so maybe not something that could have been caught as easily as their Zen 2 always--1 bug with a simple statistical-quality test.
You'd hope that AMD won't keep having these bugs after it's happened twice now, but you never know when another such incident will happen from any vendor. Including non-x86 ISAs, there are quite a few vendors. And in a VM, if any such instruction on any ISA involves a VM-exit, buggy VM software could cause a problem.
x86 rdrand is also very slow in some CPUs with updated microcode to work around side-channels that might let other processes see what randomness is in the queue. It was already not fast, but on those systems it's almost as slow as a system call (thousands of cycles). See Phoronix: RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation
See also the performance section at the bottom of my answer on RDRAND and RDSEED intrinsics on various compilers? which summarizes some cycle counts from https://uops.info/ (which tested the instructions in long-running loops; one-off calls might be different because there is buffering.)
But on some CPUs it is faster than a syscall, so if you need lots of true randomness, rdrand can be faster on some CPUs. Hopefully on newer CPUs that mitigate these problems in hardware instead of with microcode workarounds that flush buffers.
But even with fast rdrand, maybe about equal or slower than getting a larger buffer from the kernel and going through that, to amortize the syscall overhead.
If you need to seed your own PRNG, you should be using rdseed, which is somewhat slower than rdrand, at least in sustained throughput.
A trivial rdrand infinite-retry loop is good for code-size, being very compact compared to a library function call. Just 1: rdrand rax ; jnc 1b which is 6 bytes on x86-64.
No arg setup, and the compiler can keep stuff live in call-clobbered registers, unlike across a function call. (Inlining a syscall instruction for Linux getrandom would allow the same lack-of-clobber benefit, but that's not the normal way to do things on GNU/Linux, and definitely not Windows. And getrandom has 3 args and puts the result in memory, and you should check the return value...)