Exactly how "fast" are modern CPUs?
Asked Answered
F

15

20

When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruction took to execute. Depending on family, one (or four) cycles equated to one "memory fetch", and without caches to worry about, you could guess timings based on the number of memory accesses involved.

But with modern CPU's, I'm confused. I know they're a lot faster, but I also know that the headline gigahertz speed isn't helpful without knowing how many cycles of that clock are needed for each instruction.

So, can anyone provide some timings for two sample instructions, on (let's say) a 2GHz Core 2 Duo. Best and worst cases (assuming nothing in cache/everything in cache) would be useful.

Instruction #1: Add one 32-bit register to a second.

Instruction #2: Move a 32-bit value from register to memory.

Edit: The reason I ask this is to try and develop a "rule-of-thumb" that would allow me to look at simple code and roughly gauge the time taken to the nearest order of magnitude.

Edit #2: Lots of answers with interesting points, but nobody (yet) has put down a figure measured in time. I appreciate there are "complications" to the question, but c'mon: If we can estimate the number of piano-tuners in NYC, we should be able to estimate code runtimes...

Take the following (dumb) code:

int32 sum = frigged_value();

// start timing
 for (int i = 0 ; i < 10000; i++)
 {
   for (int j = 0 ; j < 10000; j++)
   {
     sum += (i * j)
   }
   sum = sum / 1000;
 }

// end timing

How can we estimate how long it will take to run... 1 femtosecond? 1 gigayear?

Fullback answered 11/1, 2009 at 15:53 Comment(7)
What for do you need this knowledge?Hoseahoseia
Hopefully the compiler will notice that your loop is pure and optimize the computation away.Amersfoort
@jrockway: sum = frigged_value() should make that almost impossible.Fullback
What I'm reading is: if you're asking a theoretical question without context, then maybe somebody would be able to give you a meaningless answer (or you can calculate your own). And if you were to provide context, then it would still be easier and more accurate to test.Vocalise
@le dorfier : If you feel there's context missing, then make some assumptions (listing them, if you like), and have an educated guess. As I said, I'm not after an accurate figure.Fullback
Related: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? (But that's about the asm. In your case, there's another factor of compile-time optimization, whether or not the compiler finds optimizations like strength-reduction of the multiply into simple addition, and auto-vectorizing with SIMD, or with the closed-form sum of 0..10000 and then multiply by i. And yes, clang does recognize sum-of-linear-sequence loops and optimize with a closed form that avoids overflow.)Avarice
Agner Fog has put together some good resources about this topic: agner.org/optimize agner.org/optimize/#testpOriana
A
42

Modern processors such as Core 2 Duo that you mention are both superscalar and pipelined. They have multiple execution units per core and are actually working on more than one instruction at a time per core; this is the superscalar part. The pipelined part means that there is a latency from when an instruction is read in and "issued" to when it completes execution and this time varies depending on the dependencies between that instruction and the others moving through the other execution units at the same time. So, in effect, the timing of any given instruction varies depending on what is around it and what it is depending on. This means that a given instruction has sort of a best case and worst case execution time based on a number of factors. Because of the multiple execution units you can actually have more than one instruction completing execution per core clock, but sometimes there is several clocks between completions if the pipeline has to stall waiting for memory or dependencies in the pipelines.

All of the above is just from the view of the CPU core itself. Then you have interactions with the caches and contention for bandwidth with the other cores. The Bus Interface Unit of the CPU deals with getting instructions and data fed into the core and putting results back out of the core through the caches to memory.

Rough order of magnitude rules of thumb to be taken with a grain of salt:

  • Register to Register operations take 1 core clock to execute. This should generally be conservative especially as more of these appear in sequence.
  • Memory related load and store operations take 1 memory bus clock to execute. This should be very conservative. With a high cache hit rate it will be more like 2 CPU bus clocks which is the clock rate of the bus between the CPU core and the cache, but not necessarily the core's clock.
Acrosstheboard answered 11/1, 2009 at 16:16 Comment(5)
A current-generation AMD or Intel multicore processor can deliver two words per CPU clock from the level 1 cache.Housen
@Norman - I agree and there are lots of possibilities for variation here, but remember instructions have to be fetched too, not just the data in and out. So, I'll restate all the caveats of rough order of magnitude, grain of salt, etc. towards my rule of thumb average numbers.Acrosstheboard
@NormanRamsey: Core2 / Nehalem have an L1d throughput of 1 load and 1 store per clock, with either being any width up to 128-bit (SSE vector). AMD was ahead of Intel for a while, with 2 memory ops per clock which could both be loads or 1 load 1 store. Haswell and later can do 2 loads + 1 store per clock, all of which can be 256 bits wide. But yes, this answer's estimate of 2 core clocks per memory access is way too expensive for L1d hits. Maybe a good rule of thumb for L2 hits. (L1 and L2 are per-core private and run at core frequency; L3 runs at max(any core) on Intel)Avarice
For more details, agner.org/optimize; also realworldtech.com/sandy-bridge; and other links in stackoverflow.com/tags/x86/info. Also more links to x86 "cost model" stuff in #58802823 (the actual question is nonsense, but my answer mentions latency vs. front-end throughput vs. back-end port pressure as the actual dimensions of performance for a single instruction or small block.)Avarice
Instructions / clock can easily vary from (much) less than 1 up to close to 4 or 5, depending on how workload. Skylake running SPECint as compiled by modern compilers gets about 1.7 IPC. (researchgate.net/publication/…)Avarice
L
14

It is nearly impossible to provide accurate timing information that you are expecting in a way that will be USEFUL to you.

The following concepts affect instruction timing; some can vary from moment to moment:

  • Micro-op decomposition
  • Operation pipelining
  • Super-scalar execution
  • Out of order execution
  • SMT / SMP execution
  • Floating point mode
  • Branch prediction / pre-fetch
  • Cache latency
  • Memory latency
  • Clock speed throttling
  • etc

Consult a book on modern computer architecture if you need any further explanation on the above concepts.

The best way to measure the speed of your code is (surprise!) to measure the speed of your code running the same workload and under the same conditions as you expect it to when "in the real world".

Linear answered 11/1, 2009 at 16:37 Comment(0)
F
9

Using a description largely based on Intel Pentium architecture, to cut a very very long story short:

  • the processor has a number of "execution units" that can perform different types of 'micro-ops'; instructions may be split into several micro-ops
  • the different execution units essentially run in parallel
  • each micro-op ties up the corresponding execution unit for a certain number of clock cycles so meanwhile no other instruction can use that execution unit: e.g. "floating point add" may tie up the "FP execute" unit for 2 clock cycles
  • execution units are grouped by "port", and each clock cycle, a new micro-op can be sent to each port (assuming the relevant execution unit is free at that moment); some units can also be sent an "extra op" halfway through the cycle; so each clock cycle, a certain number of ops can start executing;
  • the processor can re-order micro-ops where this doesn't break dependencies (or where the result can still be reconstructed) to take advantage of which execution units are free at a given moment
  • so instructions can be executing in parallel, but which parts of which instructions are executing at any one time is quite a complex situation
  • the overall time for a given instruction thus depends on how long it had to "wait" for the necessary execution units to become available, the actual time that those ops spent running on the given units, plus any extra time required to "tie up the result"

Since the timing of an instruction depends on the surrounding instructions, in practice, it's usually best to time a representative piece of code than try and worry about individual instructions. However:

  • Intel (and presumably other manufacturers) publish a list of instruction throughput and latency timings
  • the throughput is the number of clock cycles actually needed on the relevant execution unit(s)
  • the latency is a "worst case" number of clock cycles required, once an instruction starts executing, before the result of that execution is available as input to another instruction

So for example, if, say, floating point add and multiply instructions each have a throughput of 2 and a latency of 5 (actually, for multiply it's a bit greater I think), that means that adding a register to itself or multiplying it by itself will likely take two clock cycles (since there are no other dependent values), whereas adding it the result of a previous multiplication will take something like or a bit less than 2+5 clock cycles, depending where you start/finish timing, and on all sorts of other things. (During some of those clock cycles, another add/multiply operation could be taking place, so it's arguable how many cycles you actually attribute to the individual add/mutliply instructions anyway...)

Oh, and just as a concrete example. For following Java code

public void runTest(double[] data, double randomVal) {
  for (int i = data.length-1; i >= 0; i--) {
    data[i] = data[i] + randomVal;
  }
}

Hotspot 1.6.12 JIT-compiles the inner loop sequence to the following Intel code, consisting of a load-add-store for each position in the array (with 'randomVal' being held in XMM0a in this case):

  0b3     MOVSD  XMM1a,[EBP + #16]
  0b8     ADDSD  XMM1a,XMM0a
  0bc     MOVSD  [EBP + #16],XMM1a
  0c1     MOVSD  XMM1a,[EBP + #8]
  0c6     ADDSD  XMM1a,XMM0a
  0ca     MOVSD  [EBP + #8],XMM1a
  ...

each group of load-add-store appears to take 5 clock cycles.

Fia answered 11/1, 2009 at 16:48 Comment(1)
Re: "Intel (and presumably other manufacturers) publish a list of instruction throughput and latency timings", Yes, AMD publishes these numbers as well. You just have to go to their web site and download the "optimization guide" for a processor.Terrel
L
7

It's not that simple. The timing for your two instructions won't help you gauge performance of a larger set of instructions much. That's because modern processors can execute many operations in parallel, and have large caches so "moving a value to memory" happens at a time quite removed from the instruction's execution.

So, best case is zero (when executed in parallel with other instructions). But how does that help you?

This web page shows some benchmarks, including some %MIPS/MHz results. As you can see, on many benchmarks there are multiple instructions executed per clock cycle. The charts also show the effects of cache size and memory speed.

Linsk answered 11/1, 2009 at 16:1 Comment(3)
I think he's just asking for some kind of average latency for some simple instructions.Ellamaeellan
Understood. that's why I asked for best/worst cast times. And it's just to get a rough handle on things.Fullback
TO be more precise, no instructions ever execute in zero clocks. There can be zero clocks between instruction completions as viewed in the linear sequence, but there is always a latency from start to finish for any given instruction and it is actually several clocks.Acrosstheboard
H
7

Modern processors do even more tricky things.

Out-of-order execution. If it is possible to do so without affecting correct behavior, the processors may execute instructions in a different order than they are listed in your program. This can hide the latency of long-running instructions.

Register renaming. Processors often have more physical registers than addressable registers in their instruction set (so-called "architectural" registers). This can be either for backward compatibility, or simply to enable efficient instruction encodings. As a program runs, the processor will "rename" the architectural registers it uses to whatever physical registers are free. This allows the processor to realize more parallelism than existed in the original program.

For instance, if you have a long sequence of operations on EAX and ECX, followed by instructions that re-initialize EAX and ECX to new values and perform another long sequence of operations, the processor can use different physical registers for both tasks, and execute them in parallel.

The Intel P6 microarchitecture does both out-of-order execution and register renaming. The Core 2 architecture is the latest derivative of the P6.

To actually answer your question - it is basically impossible for you to determine performance by hand in the face of all these architectural optimizations.

Hudgins answered 11/1, 2009 at 16:38 Comment(0)
H
7

The kind of prediction you're asking for is hopeless.

If you want a rule of thumb, here are some rules of thumb:

  • In the time it takes to get a word from level 2 cache, a processor can execute at least 10 instructions. So worry about memory access, not instruction counts---computation in registers is almost free.

  • In the time it takes to get a word from RAM, a processor can execute thousands of instructions (this number varies by a couple of order of magnitude depending on the details of your hardware). Make sure this happens only on a cold cache; otherwise nothing else matters.

  • If you're running on x86 CPUs, there aren't enough registers. Try not to have more than 5 live variables in your code at any moment. Or better yet, move to AMD64 (x86_64) and double the number of registers. With 16 registers, and parameters passed in registers, you can quit worrying about registers.

There was a time when every year I would ask an architect what rules of thumb I should use to predict the cost of the code my compilers generate. I've stopped, because the last time I received a useful answer was in 1999. (The answer was "make sure your loops fit in the reorder buffer". All those who know what a reorder buffer is may now raise your hands. Bonus points if you can discover the size of the reorder buffer on any computer you are currently using.)

Housen answered 11/1, 2009 at 20:45 Comment(3)
Thanks. It makes sense that memory access speed is basically they key, as modern CPU architectures effectively decouple memory and CPU usage much better.Fullback
Good points here. Definitely agree about memory access. Mispredicted branch is another speed killer. Nicely enough, modern CPUs offer performance count features just for looking at this kind of thing.Proteus
Almost free ... until you're running a loop like this where the loop body doesn't touch memory, then it's pure latency (dependencies) or throughput of ALU instructions. And we're of course at the mercy of compiler optimizations to spot stuff like strength-reduction or do auto-vectorization, or apply the closed-form formula for sum of j=1..n (even if scaled by a loop-invariant like i)Avarice
T
5

You can download the Intel 64 and IA-32 manuals here.

But what you really need is the stuff from Agner Fog.

He has a lot of additional infos, for example his manual "Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel and AMD CPUs".

Or test programs for counting clock cycles (he uses the time stamp counter).

Tribal answered 26/1, 2009 at 16:55 Comment(0)
H
5

This only answers part of your question, but I found this table from Wikipedia on locality of reference helpful. It describes the speed of access to and amount of memory in different levels of the memory hierarchy, using approximate 2006 times:

  • CPU registers (8-32 registers) – immediate access (0-1 clock cycles)
  • L1 CPU caches (32 KiB to 128 KiB) – fast access (3 clock cycles)
  • L2 CPU caches (128 KiB to 12 MiB) – slightly slower access (10 clock cycles)
  • Main physical memory (RAM) (256 MiB to 4 GiB) – slow access (100 clock cycles)
  • Disk (file system) (1 GiB to 1 TiB) – very slow (10,000,000 clock cycles)
  • Remote Memory (such as other computers or the Internet) (Practically unlimited) – speed varies
Herringbone answered 21/2, 2009 at 8:55 Comment(2)
I wonder where these numbers come from..? I guess you cannot measure how long it takes to access something from cache (from main mem?) How do clock cycles translate into nanoseconds?Mcnew
@Nils: sure you can. The standard way to measure cache load->use latency is pointer-chasing, usually by traversing a linked list. Make the linked list small and circular (or a pointer that points to itself) and you're measuring L1d. Make it big enough to not fit in L1 and you're measuring L2. You can check with CPU performance counters that you're getting mostly L1 misses and L2 hits. Same for measuring L3 or main memory. You can also have a loop that traverses 2 or 3 linked lists in parallel to test memory-level parallelism.Avarice
T
4

A lot of good answers on this thread already, but one topic is so far unmentioned: branch misprediction.

Because all modern processors are pipelined, when the instruction decoder runs into an instruction like "jump if equal", it has no idea which way the instruction will jump, and so it just guesses. It then continues feeding instructions into the pipeline based on that guess. If it made the correct prediction, the thruput and latency of the jump instruction is essentially zero. If it makes the wrong guess, the thruput and latency of the same jump instruction could be 50 or a 100 cycles.

Note that the same instruction can have the "zero cost" the first time it's executed in a loop and the really huge cost the next time the same instruction is executed!

Terrel answered 27/1, 2009 at 5:32 Comment(3)
Mis-predicted branches is a consideration but the hit in cost I would not consider as being "really huge". For example, a data miss on both L1 & L2 cache is a much larger hit. Usually, the prediction miss is about the same as instruction pipeline's depth. ie: A pipeline restart is needed.Acrosstheboard
Right, well "really huge" is relative, and it depends on which processor you're talking about. Some have much longer pipelines than others.Terrel
It was really bad on the Pentium 4s. It is pretty bad on hyperthreading Nehalem too, although it gets more work done overall by switching threads.Proteus
A
3

All you need is in the appropriate CPU manuals. Both AMD and Intel have PDF's available on their website describing the latencies of every instruction.

Just keep in mind the complexity of modern CPU's. They don't execute one instruction at a time, they can load 3-4 instructions per cycle, and almost all instructions are pipelined so when the next instructions are loaded, the current ones are nowhere near finished. It also reorders instructions to allow for a more efficient scheduling. A modern CPU can easily have 50 instructions in progress at a time.

So you're asking the wrong question. The time taken for a single instruction varies wildly depending on how and when you measure. It depends on how busy the instruction decoder is, on branch predictor, on scheduling and on which other instructions are being scheduled, in addition to the simple issues like caching.

Anergy answered 11/1, 2009 at 17:59 Comment(0)
T
3

I recommend downloading the AMD software optimization guide.

Terrel answered 27/1, 2009 at 5:38 Comment(0)
P
3

It took almost 11 years, but I have an estimate. Your loop is about 10 ops * 100 million iterations, so approximately 1 billion ops. On a 2.3 GHz machine, I would estimate on the order of 0.4 seconds. When I tested it, I actually got 1.2 seconds. So it's within one order of magnitude.

Just take your core frequency, estimate the ops, and divide. This gives a very rough estimate and I've never been more than an order of magnitude off whenever I test empirically. Just make sure your op estimates are reasonable.

Pagination answered 14/11, 2019 at 1:55 Comment(1)
This should be ticked as the correct answer.Vivianne
O
2

As Doug already noted, the best case is zero (superscalar processor, multiple execution units, data already in L1 cache).

The worst case is up to several miliseconds (when the OS handles a pagefault and has to fetch the data/instruction from the disk). Excluding disk/swapping it still depends on whether you have a NUMA machine, which kind of topology it has, in which memory node the data lies, whether there is concurrent access from another CPU (bus locking and cache synchronization protocols), etc.

Ortrude answered 11/1, 2009 at 16:15 Comment(1)
Actually to be more precise, no instructions ever execute in zero clocks. There can be zero clocks between instruction completions as viewed in the linear sequence, but there is always a latency from start to finish for any given instruction and it is actually several clocks.Acrosstheboard
V
2

An interesting quote from Alan Kay in 2004:

Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures.

The implication seems to be that CPU performance enhancements seems to focus on areas where they have relatively little impact on the software we really write.

Vocalise answered 11/1, 2009 at 18:14 Comment(3)
I think the point is that CPU's themselves have gotten that much faster, but the systems around them have not. The real bottleneck today is the I/O whether memory, disk, or network. CPU MIPS is only the bottleneck on a relatively narrow classes of apps compared to what most people use computers for.Acrosstheboard
I've seen supporting quotes asserting that the basis for measurement was "bytecodes-executed-per-second"; so it had nothing to do with other system resources. Do you have any supporting references, or is that a guess? I doubt AK would have made the statement if it was so easily discredited.Vocalise
I wonder what benchmark that was, but could not find it by quickly googling.Mcnew
E
0

I don't think the worst case is bounded on some platforms. When you have multiple cores and processors vying for the same locations or adjacent memory locations you can see all kinds of degradation in performance. Cache lines have to get moved around from processor to processor. I've haven't seen a good worst case number for memory operations on modern platforms.

Evangelize answered 26/1, 2009 at 17:32 Comment(2)
Perhaps a better example of worst case being somewhat unbounded is a data access to a page that needs to be swapped in. ie: A virtual memory page miss. Aside from that, any instruction completion might be kind of long due to factors mentioned, but I think those have well bounded upper limits.Acrosstheboard
Cache line contention, oy! You can get negative scaling if you hammer on the same variable from multiple threads!Proteus

© 2022 - 2024 — McMap. All rights reserved.