What is the latency and throughput of the RDRAND instruction on Ivy Bridge?
Asked Answered
N

4

31

I cannot find any info on agner.org on the latency or throughput of the RDRAND instruction. However, this processor exists, so the information must be out there.

Edit: Actually the newest optimization manual mentions this instruction. It is documented as <200 cycles, and a total bandwidth of at least 500MB/s on Ivy Bridge. But some more in-depth statistics on this instruction would be great since the latency and throughput is variable.

Nara answered 7/5, 2012 at 14:49 Comment(5)
I don't know the answer, without running a benchmark, but as an interested party may I ask "How fast do you want it to be?" I.e. what apps need lots of RDRANDs? By the way, there are two se6parate questions here: (a) how fast the instruction is, in terms of latency and throughput, but also (b) can it be read faster than the entropy pool accumulates? I.e. can you exhaust the entropy pool, and just be running off pseudo-random numbers?Laurentia
The only reason I can think of why anyone would care is to decide whether to use RDRAND directly or through a PRNG. You'll get the same observable behavior in both cases, but one might be significantly faster than the other, and it's not immediately obvious which one that would be. (KrazyGlew: Your b is kind of irrelevant. It's like asking how much Holy water you get before it switches to water. There is no detectable difference between the two, and the distinction is essentially meaningless in this context.)Insatiable
@KrazyGlew A use-case is generating random numbers for statistical sampling on a GPU.Nara
Related: Is there any legitimate use for Intel's RDRAND? has a benchmark against a std::mt19937 PRNG. If anything, RDRAND is probably slower than in that test, because they don't use the result (which is problematic in asm as David's answer explains).Streetlight
Agner's testing includes RDRAND numbers now. IvB throughtput: one per 104-117 clocks. SKL throughput: one per ~460 clocks. (But presumably this is dependent on core clock speed, if the DRNG runs at constant clock. Still, Agner tested on an i7-3770k so the IvB shouldn't have been clocked extremely low, making RDRAND look fast. Unless it was at idle clock speed? Or maybe his testing didn't use the result either, and IvB squashed the "dead" uops better than SKL.)Streetlight
W
33

I wrote librdrand. It's a very basic set of routines to use the RdRand instruction to fill buffers with random numbers.

The performance data we showed at IDF is from test software I wrote that spawns a number of threads using pthreads in Linux. Each thread pulls fills a memory buffer with random numbers using RdRand. The program measures the average speed and can iterate while varying the number of threads.

Since there is a round trip communications latency from each core to the shared DRNG and back that is longer than the time needed to generate a random number at the DRNG, the average performance obviously increases as you add threads, up until the maximum throughput is reached. The physical maximum throughput of the DRNG on IVB is 800MBytes/s. A 4 core IVB with 8 threads manages something of the order of 780Mbytes/s. With fewer threads and cores, lower numbers are achieved. The 500MB/s number is somewhat conservative, but when you're trying to make honest performance claims, you have to be.

Since the DRNG runs at a fixed frequency (800MHz) while the core frequencies may vary, the number of core clock cycles per RdRand varies, depending on the core frequency and the number of other cores simultaneously accessing the DRNG. The curves given in the IDF presentation are a realistic representation of what to expect. The total performance is affected a little by core clock frequency, but not much. The number of threads is what dominates.

One should be careful when measuring RdRand performance to actually 'use' the RdRand result. If you don't, I.E. you did this.. RdRand R6, RdRand R6,....., RdRand R6 repeated many times, the performance would read as being artificially high. Since the data isn't used before it is overwritten, the CPU pipeline doesn't wait for the data to come back from the DRNG before it issues the next instruction. The tests we wrote write the resulting data to memory that will be in on-chip cache so the pipeline stalls waiting for the data. That is also why hyperthreading is so much more effective with RdRand than with other sorts of code.

The details of the specific platform, clock speed, Linux version and GCC version were given in the IDF slides. I don't remember the numbers off the top of my head. There are chips available that are slower and chips available that are faster. The number we gave for <200 cycles per instruction is based on measurements of about 150 core cycles per instruction.

The chips are available now, so anyone well versed in the use of rdtsc can do the same sort of test.

Waine answered 14/6, 2012 at 23:32 Comment(8)
Please add a link to the IDF presentation.Marylyn
"I wrote librdrand" - 'nuf said.Sidewinder
So rdrand is like a high-latency load? Agner Fog's numbers indicate a throughput of one per ~110c on IvB, or one per ~460cycles on Skylake. I'm curious how much computation can overlap with rdrand, since most code that uses random numbers actually has lots of work to do other than generating random numbers. So I'm curious how much it would slow down some real code to use RDRAND instead of a super-fast PRNG like xorshift, or even vs. the fastest-possible non-random number generator: xor eax, eax.Streetlight
Do you get better results from sofware-pipelining? Generating the next iteration's random number before some slow calculation that hides the latency? Or does that not help much, because the rdrand itself can't retire, so it's stuck in the ROB?Streetlight
Update: phoronix.com/scan.php?page=news_item&px=RdRand-3-Percent / arstechnica.com/gadgets/2019/10/… - Microcode workarounds for speculative execution data leaks have crippled rdrand performance, like 3% the previous speed on some Intel. See uops.info - like one per 3554 cycles on Skylake. (Summarized at the bottom of this answer on intrinsics for rdrand / rdseed.)Streetlight
@PeterCordes: ouch! Is there a way to turn off these stupid mitigations? With such latency you might be better off calling __rdtsc() and hashing the result with a simple mixing / shuffling function for your PRNG needs.Thrasonical
@VioletGiraffe: Normal use of rdrand or rdseed (or a system call like read or getrandom) is just to seed a PRNG, cryptographically secure or otherwise. Even before this, it was too slow for most use-cases where anything else is acceptable for quality. If your need for quality entropy is so low that the low bits of rdtsc are usable, then sure, that always works without needing a retry loop or any CPU compatibility checking; you wouldn't consider the bother of inlining rdrand for that instead of e.g. C++ std::random_deviceStreetlight
@VioletGiraffe: AFAIK, no, you can't turn off that mitigation. It's in the microcode for rdrand directly, not like Spectre and MDS mitigation stuff the OS can choose to use or not. Most applications get their randomness from the kernel, not rdrand directly anyway, so the extra CPU time running rdrand (and mixing that into an entropy pool) is asynchronous to the need for random numbers.Streetlight
G
7

You'll find some relevant information at Intel Digital Random Number Generator (DRNG) Software Implementation Guide.

A verbatim quote follows:

Measured Throughput:

Up to 70 million RDRAND invocations per second
500+ million bytes of random data per second
Throughput ceiling is insensitive to the number of contending parallel threads
Granulation answered 8/6, 2012 at 7:38 Comment(5)
@user434507 - Always good to include the relevant bit. That link could break and this answer would become meaningless. I've done this for you this time :)Luxembourg
Quote: This has the effect of distilling the entropy into more concentrated samples. Awesome, isn't it?Are
@ArjunShankar, you are right and I considered doing that too, but there's also a number of interesting charts in the article.Granulation
70 million invocations per second. At what clock speed? That kinda matters too.Rhines
In case somebody reads the last comment, no it doesn't since the DRNG runs at 800 MHz regardless of the CPU speed (on Ivy Bridge anyways), see David's answerPriestcraft
D
4

I have done some preliminary throughput tests on an actual Ivy Bridge i7-3770 using Intel's "librdrand" wrapper and it generates 33-35 million 32bit numbers per second on a single core.

This 70M number from Intel is about 8 cores; for one they report only about 10M, so my test is over 3x better :-/

Dedicated answered 10/6, 2012 at 16:21 Comment(1)
Did you actually use the result? David's answer says that the CPU discards incomplete rdrand uops if the result register is simply overwritten. (So e.g. store to memory or XOR it into something.)Streetlight
H
4

Here are some performance figures I get with rdrand: http://smackerelofopinion.blogspot.co.uk/2012/10/intel-rdrand-instruction-revisited.html

On a i5-3210M (2.5GHz) Ivybridge (2 cores, 4 threads) I get a peak of ~99.6 million 64 bit rdrands per second with 4 threads which equates to ~6.374 billion bits per second.

An 8 threaded i7-3770 (3.4GHz) Ivybridge (4 cores, 8 threads) I hit a peak throughput of 99.6 million 64 bit rdrands a second on 3 threads.

Hyponitrite answered 14/4, 2013 at 17:1 Comment(1)
How do you invoke stress-ng to get the throughput numbers? The best I have been able to do is stress-ng --rdrand 1 --metrics -t 60, but the metrics (like BogoMIPS) are not very useful to me.Changeup

© 2022 - 2024 — McMap. All rights reserved.