RDRAND and RDSEED intrinsics on various compilers?
Asked Answered
W

3

5

Does Intel C++ compiler and/or GCC support the following Intel intrinsics, like MSVC does since 2012 / 2013?

#include <immintrin.h>  // for the following intrinsics
int _rdrand16_step(uint16_t*);
int _rdrand32_step(uint32_t*);
int _rdrand64_step(uint64_t*);
int _rdseed16_step(uint16_t*);
int _rdseed32_step(uint32_t*);
int _rdseed64_step(uint64_t*);

And if these intrinsics are supported, since which version are they supported (with compile-time-constant please)?

Wetzell answered 31/3, 2015 at 15:49 Comment(1)
Clang may have tied RDSEED to AVX2. Also see Add RDSEED intrinsic support defined in AVX2 extension. I can't seem to get RDSEED to engage with -mrdseed in Clang 6.0...Packard
H
6

All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)

#include <immintrin.h>
#include <stdint.h>

// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
    unsigned long long ret;   // not uint64_t, GCC/clang wouldn't compile.
    do{}while( !_rdrand64_step(&ret) );  // retry until success.
    return ret;
}

// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.

Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.

Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.

Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.


Compiler versions to support intrinsics

Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.

CPU gcc clang MSVC ICC
rdrand Ivy Bridge / Excavator 4.6 3.2 before 2015 (19.10) before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed Broadwell / Zen 1 9.1 7.0 before 2015 (19.10) before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)

The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)

Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)

GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.

  • -mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
  • -mrdseed: enabled by -march=broadwell or -march=znver1 or later

Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.

But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)

Wrappers that compile everywhere

This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)

The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.

There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.

#include <immintrin.h>
#include <stdint.h>

#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER       // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif

//#if defined(__RDRND__) || defined(_MSC_VER)  // conditional definition if you want
inline
uint64_t rdrand64(){
    intrin_u64 ret;
    do{}while( !_rdrand64_step(&ret) );  // retry until success.
    return ret;
}
//#endif

#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
    intrin_u64 ret;
    do{}while( !_rdseed64_step(&ret) );   // retry until success.
    return ret;
}
#endif  // RDSEED
#endif  // x86-64

//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
    unsigned ret;      // Intel documents this as unsigned int, not necessarily uint32_t
    do{}while( !_rdrand32_step(&ret) );   // retry until success.
    return ret;
}

#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
    unsigned ret;      // Intel documents this as unsigned int, not necessarily uint32_t
    do{}while( !_rdseed32_step(&ret) );   // retry until success.
    return ret;
}
#endif

The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.

On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1

# clang 7.0 -O3 -march=broadwell    MSVC -O2 does the same.
rdrand64():
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        rdrand  rax
        jae     .LBB0_1      # synonym for jnc - jump if Not Carry
        ret

# same for other functions.

Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:

# gcc 12.1 -O3 -march=broadwell
rdrand64():
        mov     edx, 1
.L2:
        rdrand  rax
        mov     QWORD PTR [rsp-8], rax    # store into the red-zone where retval is allocated
        cmovc   eax, edx                  # materialize a 0 or 1  from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
        test    eax, eax                  # then test+branch on it
        je      .L2                       # could have just been jnc after rdrand
        mov     rax, QWORD PTR [rsp-8]     # reload retval
        ret

rdseed64():
.L7:
        rdseed  rax
        mov     QWORD PTR [rsp-8], rax   # dead store into the red-zone
        jnc     .L7
        ret

ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.


Footnote 1: rdrand/rdseed library wrappers

librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).

A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.

libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.

But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.


Performance / internals of rdrand / rdseed

For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.

For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.

Infinite retry count?

The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.

In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.

Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.

But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.

If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.

Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.

@jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)

Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.

An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!

If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.

https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.

These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:

As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.

Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.

(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)

Hypogenous answered 16/5, 2022 at 21:40 Comment(4)
This could probably use some proof-reading for sentence structure and stray words; it got pretty long and I didn't go back and read through the whole thing. Feel free to edit for such mistakes, or let me know.Hypogenous
what to use for RDRAND on apple silicon?Gambetta
@instantlink: No idea. This is an x86 question. support.apple.com/en-gb/guide/security/seca0c73a75b/web mentions the entropy sources used by apple OSes, and doesn't mention anything that sounds like an rdrand instruction for non-Intel hardware. I wonder if the x86 emulated by Rosetta 2 has rdrand or not, and if so what it uses. Probably it emulates an x86 without the rdrand feature, and the instruction would fault. So probably you should just make a getentropy system call. Or on bare metal, apparently there's a "Secure Enclave hardware TRNG", probably accessible with MMIO.Hypogenous
Thank you. stackoverflow.com/questions/75177059/… - gives more detail tooGambetta
R
7

Both GCC and Intel compiler support them. GCC support was introduced at the end of 2010. They require the header <immintrin.h>.

GCC support has been present since at least version 4.6, but there doesn't seem to be any specific compile-time constant - you can just check __GNUC_MAJOR__ > 4 || (__GNUC_MAJOR__ == 4 && __GNUC_MINOR__ >= 6).

Razz answered 31/3, 2015 at 18:3 Comment(3)
thank you for the answer, what you provided was exactly the kind of compile-time-constant I needed. Apparently it looks like you'd require 4.8 for rdseed. Do you know the versioning-compile-constant for Intel XE Composer Studio 2013 Update 1 (introduction of rdseed)?Wetzell
You also need to check for the preprocessor definition __RDRND__. Without it, the intrinsics will not be available (even if the CPU supports it). You may need to compile with -mrdrnd to ensure __RDRND__ (even if the CPU supports it). You can also use -mrdrnd when the CPU does not support the instruction. Also see Clang Bug 25152 - RDRAND intrinsic and "error: clang frontend command failed with exit code 70".Packard
Or better, -march=native if building on the target machine.Hypogenous
H
6

All the major compilers support Intel's intrinsics for rdrand and rdseed via <immintrin.h>.
Somewhat recent versions of some compilers are needed for rdseed, e.g. GCC9 (2019) or clang7 (2018), although those have been stable for a good while by now. If you'd rather use an older compiler, or not enable ISA-extension options like -march=skylake, a library1 wrapper function instead of the intrinsic is a good choice. (Inline asm is not necessary, I wouldn't recommend it unless you want to play with it.)

#include <immintrin.h>
#include <stdint.h>

// gcc -march=native or haswell or znver1 or whatever, or manually enable -mrdrnd
uint64_t rdrand64(){
    unsigned long long ret;   // not uint64_t, GCC/clang wouldn't compile.
    do{}while( !_rdrand64_step(&ret) );  // retry until success.
    return ret;
}

// and equivalent for _rdseed64_step
// and 32 and 16-bit sizes with unsigned and unsigned short.

Some compilers define __RDRND__ when the instruction is enabled at compile-time. GCC/clang since they supported the intrinsic at all, but only much later ICC (19.0). And with ICC, -march=ivybridge doesn't imply -mrdrnd or define __RDRND__ until 2021.1.
ICX is LLVM-based and behaves like clang.
MSVC doesn't define any macros; its handling of intrinsics is designed around runtime feature detection only, unlike gcc/clang where the easy way is compile-time CPU feature options.

Why do{}while() instead of while(){}? Turns out ICC compiles to a less-dumb loop with do{}while(), not uselessly peeling a first iteration. Other compilers don't benefit from that hand-holding, and it's not a correctness problem for ICC.

Why unsigned long long instead of uint64_t? The type has to agree with the pointer type expected by the intrinsic, or C and especially C++ compilers will complain, regardless of the object-representations being identical (64-bit unsigned). On Linux for example, uint64_t is unsigned long, but GCC/clang's immintrin.h define int _rdrand64_step(unsigned long long*), same as on Windows. So you always need unsigned long long ret with GCC/clang. MSVC is a non-problem as it can (AFAIK) only target Windows, where unsigned long long is the only 64-bit unsigned type.
But ICC defines the intrinsic as taking unsigned long* when compiling for GNU/Linux, according to my testing on https://godbolt.org/. So to be portable to ICC, you actually need #ifdef __INTEL_COMPILER; even in C++ I don't know a way to use auto or other type-deduction to declare a variable that matches it.


Compiler versions to support intrinsics

Tested on Godbolt; its earliest version of MSVC is 2015, and ICC 2013, so I can't go back any further. Support for _rdrand16_step / 32 / 64 were all introduced at the same time in any given compiler. 64 requires 64-bit mode.

CPU gcc clang MSVC ICC
rdrand Ivy Bridge / Excavator 4.6 3.2 before 2015 (19.10) before 13.0.1, but 19.0 for -mrdrnd defining __RDRND__. 2021.1 for -march=ivybridge to enable -mrdrnd
rdseed Broadwell / Zen 1 9.1 7.0 before 2015 (19.10) before(?) 13.0.1, but 19.0 also added -mrdrnd and -mrdseed options)

The earliest GCC and clang versions don't recognize -march=ivybridge only -mrdrnd. (GCC 4.9 and clang 3.6 for Ivy Bridge, not that you specifically want to use IvyBridge if modern CPUs are more relevant. So use a non-ancient compiler and set a CPU option appropriate for CPUs you actually care about, or at least a -mtune= with a more recent CPU.)

Intel's new oneAPI / ICX compilers all support rdrand/rdseed, and are based on LLVM internals so they work similarly to clang for CPU options. (It doesn't define __INTEL_COMPILER, which is good because it's different from ICC.)

GCC and clang only let you use intrinsics for instructions you've told the compiler the target supports. Use -march=native if compiling for your own machine, or use -march=skylake or something to enable all the ISA extensions for the CPU you're targeting. But if you need your program to run on old CPUs and only use RDRAND or RDSEED after runtime detection, only those functions need __attribute__((target("rdrnd"))) or rdseed, and won't be able to inline into functions with different target options. Or using a separately-compiled library would be easier1.

  • -mrdrnd: enabled by -march=ivybridge or -march=znver1 (or bdver4 Exavator APUs) and later
  • -mrdseed: enabled by -march=broadwell or -march=znver1 or later

Normally if you're going to enable one CPU feature, it makes sense to enable others that CPUs of that generation will have, and to set tuning options. But rdrand isn't something the compiler will use on its own (unlike BMI2 shlx for more efficient variable-count shifts, or AVX/SSE for auto-vectorization and array/struct copying and init). So enabling -mrdrnd globally likely won't make your program crash on pre-Ivy Bridge CPUs, if you check CPU features and don't actually run code that uses _rdrand64_step on CPUs without the feature.

But if you are only going to run your code on some specific kind of CPU or later, gcc -O3 -march=haswell is a good choice. (-march also implies -mtune=haswell, and tuning for Ivy Bridge specifically is not what you want for modern CPUs. You could -march=ivybridge -mtune=skylake to set an older baseline of CPU features, but still tune for newer CPUs.)

Wrappers that compile everywhere

This is valid C++ and C. For C, you probably want static inline instead of inline so you don't need to manually instantiate an extern inline version in a .c in case a debug build decided not to inline. (Or use __attribute__((always_inline)) in GNU C.)

The 64-bit versions are only defined for x86-64 targets, because asm instructions can only use 64-bit operand-size in 64-bit mode. I didn't #ifdef __RDRND__ or #if defined(__i386__)||defined(__x86_64__), on the assumption that you'd only include this for x86(-64) builds at all, not cluttering the ifdefs more than necessary. It does only define the rdseed wrappers if that's enabled at compile time, or for MSVC where there's no way to enable them or to detect it.

There are some commented __attribute__((target("rdseed"))) examples you can uncomment if you want to do it that way instead of compiler options. rdrand16 / rdseed16 are intentionally omitted as not being normally useful. rdrand runs the same speed for different operand-sizes, and even pulls the same amount of data from the CPU's internal RNG buffer, optionally throwing away part of it for you.

#include <immintrin.h>
#include <stdint.h>

#if defined(__x86_64__) || defined (_M_X64)
// Figure out which 64-bit type the output arg uses
#ifdef __INTEL_COMPILER       // Intel declares the output arg type differently from everyone(?) else
// ICC for Linux declares rdrand's output as unsigned long, but must be long long for a Windows ABI
typedef uint64_t intrin_u64;
#else
// GCC/clang headers declare it as unsigned long long even for Linux where long is 64-bit, but uint64_t is unsigned long and not compatible
typedef unsigned long long intrin_u64;
#endif

//#if defined(__RDRND__) || defined(_MSC_VER)  // conditional definition if you want
inline
uint64_t rdrand64(){
    intrin_u64 ret;
    do{}while( !_rdrand64_step(&ret) );  // retry until success.
    return ret;
}
//#endif

#if defined(__RDSEED__) || defined(_MSC_VER)
inline
uint64_t rdseed64(){
    intrin_u64 ret;
    do{}while( !_rdseed64_step(&ret) );   // retry until success.
    return ret;
}
#endif  // RDSEED
#endif  // x86-64

//__attribute__((target("rdrnd")))
inline
uint32_t rdrand32(){
    unsigned ret;      // Intel documents this as unsigned int, not necessarily uint32_t
    do{}while( !_rdrand32_step(&ret) );   // retry until success.
    return ret;
}

#if defined(__RDSEED__) || defined(_MSC_VER)
//__attribute__((target("rdseed")))
inline
uint32_t rdseed32(){
    unsigned ret;      // Intel documents this as unsigned int, not necessarily uint32_t
    do{}while( !_rdseed32_step(&ret) );   // retry until success.
    return ret;
}
#endif

The fact that Intel's intrinsics API is supported at all implies that unsigned int is a 32-bit type, regardless of whether uint32_t is defined as unsigned int or unsigned long if any compilers do that.

On the Godbolt compiler explorer we can see how these compile. Clang and MSVC do what we'd expect, just a 2-instruction loop until rdrand leaves CF=1

# clang 7.0 -O3 -march=broadwell    MSVC -O2 does the same.
rdrand64():
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        rdrand  rax
        jae     .LBB0_1      # synonym for jnc - jump if Not Carry
        ret

# same for other functions.

Unfortunately GCC is not so good, even current GCC12.1 makes weird asm:

# gcc 12.1 -O3 -march=broadwell
rdrand64():
        mov     edx, 1
.L2:
        rdrand  rax
        mov     QWORD PTR [rsp-8], rax    # store into the red-zone where retval is allocated
        cmovc   eax, edx                  # materialize a 0 or 1  from CF. (rdrand zeros EAX when it clears CF=0, otherwise copy the 1)
        test    eax, eax                  # then test+branch on it
        je      .L2                       # could have just been jnc after rdrand
        mov     rax, QWORD PTR [rsp-8]     # reload retval
        ret

rdseed64():
.L7:
        rdseed  rax
        mov     QWORD PTR [rsp-8], rax   # dead store into the red-zone
        jnc     .L7
        ret

ICC makes the same asm as long as we use a do{}while() retry loop; with a while() {} it's even worse, doing an rdrand and checking before entering the loop for the first time.


Footnote 1: rdrand/rdseed library wrappers

librdrand or Intel's libdrng have wrapper functions with retry loops like I showed, and ones that fill a buffer of bytes or array of uint32_t* or uint64_t*. (Consistently taking uint64_t*, no unsigned long long* on some targets).

A library is also a good choice if you're doing runtime CPU feature detection, so you don't have to mess around with __attribute__((target)) stuff. However you do it, that limits inlining of a function using the intrinsics anyway, so a small static library is equivalent.

libdrng also provides RdRand_isSupported() and RdSeed_isSupported(), so you don't need to do your own CPUID check.

But if you're going to build with -march= something newer than Ivy Bridge / Broadwell or Excavator / Zen1 anyway, inlining a 2-instruction retry loop (like clang compiles it to) is about the same code-size as a function call-site, but doesn't clobber any registers. rdrand is quite slow so that's probably not a big deal, but it also means no extra library dependency.


Performance / internals of rdrand / rdseed

For more details about the HW internals on Intel (not AMD's version), see Intel's docs. For the actual TRNG logic, see Understanding Intel's Ivy Bridge Random Number Generator - it's a metastable latch that settles to 0 or 1 due to thermal noise. Or at least Intel says it is; it's basically impossible to truly verify where the rdrand bits actually come from in a CPU you bought. Worst case, still much better than nothing if you're mixing it with other entropy sources, like Linux does for /dev/random.

For more on the fact that there's a buffer that cores pull from, see some SO answers from the engineer who designed the hardware and wrote librdrand, such as this and this about its exhaustion / performance characteristics on Ivy Bridge, the first generation to feature it.

Infinite retry count?

The asm instructions set the carry flag (CF) = 1 in FLAGS on success, when it put a random number in the destination register. Otherwise CF=0 and the output register = 0. You're intended to call it in a retry loop, that's (I assume) why the intrinsic has the word step in the name; it's one step of generating a single random number.

In theory, a microcode update could change things so it always indicates failure, e.g. if a problem is discovered in some CPU model that makes the RNG untrustworthy (by the standards of the CPU vendor). The hardware RNG also has some self-diagnostics, so it's in theory possible for a CPU to decide that the RNG is broken and not produce any outputs. I haven't heard of any CPUs ever doing this, but I haven't gone looking. And a future microcode update is always possible.

Either of these could lead to an infinite retry loop. That's not great, but unless you want to write a bunch of code to report on that situation, it's at least an observable behaviour that users could potentially deal with in the unlikely event it ever happened.

But occasional temporary failure is normal and expected, and must be handled. Preferably by retrying without telling the user about it.

If there wasn't a random number ready in its buffer, the CPU can report failure instead of stalling this core for potentially even longer. That design choice might be related to interrupt latency, or just keeping it simpler without having to build retrying into the microcode.

Ivy Bridge can't pull data from the DRNG faster than it can keep up, according to the designer, even with all cores looping rdrand, but later CPUs can. Therefore it is important to actually retry.

@jww has had some experience with deploying rdrand in libcrypto++, and found that with a retry count set too low, there were reports of occasional spurious failure. He's had good results from infinite retries, which is why I chose that for this answer. (I suspect he would have heard reports from users with broken CPUs that always fail, if that was a thing.)

Intel's library functions that include a retry loop take a retry count. That's likely to handle the permanent-failure case which, as I said, I don't think happens in any real CPUs yet. Without a limited retry count, you'd loop forever.

An infinite retry count allows a simple API returning the number by value, without silly limitations like OpenSSL's functions that use 0 as an error return: they can't randomly generate a 0!

If you did want a finite retry count, I'd suggest very high. Like maybe 1 million, so it takes maybe have a second or a second of spinning to give up on a broken CPU, with negligible chance of having one thread starve that long if it's repeatedly unlucky in contending for access to the internal queue.

https://uops.info/ measured a throughput on Skylake of one per 3554 cycles on Skylake, one per 1352 on Alder Lake P-cores, 1230 on E-cores. One per 1809 cycles on Zen2. The Skylake version ran thousands of uops, the others were in the low double digits. Ivy Bridge had 110 cycle throughput, but in Haswell it was already up to 2436 cycles, but still a double-digit number of uops.

These abysmal performance numbers on recent Intel CPUs are probably due to microcode updates to work around problems that weren't anticipated when the HW was designed. Agner Fog measured one per 460 cycle throughput for rdrand and rdseed on Skylake when it was new, each costing 16 uops. The thousands of uops are probably extra buffer flushing hooked into the microcode for those instructions by recent updates. Agner measured Haswell at 17 uops, 320 cycles when it was new. See RdRand Performance As Bad As ~3% Original Speed With CrossTalk/SRBDS Mitigation on Phoronix:

As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.

Locking the memory bus sounds like it could hurt performance even of other cores, if it's like cache-line splits for locked instructions.

(Those cycle numbers are core clock cycle counts; if the DRNG doesn't run on the same clock as the core, those might vary by CPU model. I wonder if uops.info's testing is running rdrand on multiple cores of the same hardware, since Coffee Lake is twice the uops as Skylake, and 1.4x as many cycles per random number. Unless that's just higher clocks leading to more microcode retries?)

Hypogenous answered 16/5, 2022 at 21:40 Comment(4)
This could probably use some proof-reading for sentence structure and stray words; it got pretty long and I didn't go back and read through the whole thing. Feel free to edit for such mistakes, or let me know.Hypogenous
what to use for RDRAND on apple silicon?Gambetta
@instantlink: No idea. This is an x86 question. support.apple.com/en-gb/guide/security/seca0c73a75b/web mentions the entropy sources used by apple OSes, and doesn't mention anything that sounds like an rdrand instruction for non-Intel hardware. I wonder if the x86 emulated by Rosetta 2 has rdrand or not, and if so what it uses. Probably it emulates an x86 without the rdrand feature, and the instruction would fault. So probably you should just make a getentropy system call. Or on bare metal, apparently there's a "Secure Enclave hardware TRNG", probably accessible with MMIO.Hypogenous
Thank you. stackoverflow.com/questions/75177059/… - gives more detail tooGambetta
F
2

Microsoft compiler does not have intrinsics support for RDSEED and RDRAND instruction.

But, you may implement these instruction using NASM or MASM. Assembly code is available at:

https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide

For Intel Compiler, you can use header to determine the version. You can use following macros to determine the version and sub-version:

__INTEL_COMPILER //Major Version
__INTEL_COMPILER_UPDATE // Minor Update.

For instance if you use ICC15.0 Update 3 compiler, it will show that you have

__INTEL_COMPILER  = 1500
__INTEL_COMPILER_UPDATE = 3

For further details on pre-defined macros you can go to: https://software.intel.com/en-us/node/524490

Footstool answered 6/7, 2015 at 22:27 Comment(7)
So a __INTEL_COMPILER_BUILD_DATE >= 20121023 would actually include any compiler being newer than Intel XE composer studio 2013 Update 1? Src1 Src2Wetzell
Just found this presentation. Looks like I need 1300 and 2 for the two parameters (RDSEED). And I need 1200 and 2 for RDRAND :) SrcWetzell
... AND: MS compiler does have intrinsic support for RDRAND (VS2012) and RDSEED (2013). I verified this by finding the correct intriniscs in my header files.Wetzell
@SEJM do you have any documentation on MS supporting RDRAND and RDSEED??Footstool
Of course I do :) The official MSDN documentation lists _rdrandXX_step() and _rdseedXX_step() for x86 and x64. Make sure to check the other versions to believe me VS2012 indeed didn't have RDSEED.Wetzell
@Wetzell - According to Intel, their 12.1 compiler supports RDRAND. From the article: "... Intel® Compiler (starting with version 12.1), Microsoft Visual Studio 2012, and GCC* 4.6 support the RDRAND instruction."*Packard
@Wetzell - "The official MSDN documentation..." - be careful here. Microsoft uses Intel intrinsics, and they only support Intel CPUs, and not AMD CPUs. In fact, Microsoft states it somewhere.Packard

© 2022 - 2024 — McMap. All rights reserved.