Performance of modern processor

Asked 29/12, 2011 at 22:36 Answered 30/12, 2011 at 6:39

Solved performance assembly x86-64 amd-processor

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?

label:  mov (%rsi), %rax
        adc %rax, (%rdx)
        lea 8(%rdx), %rdx
        lea 8(%rsi), %rsi
        dec %ecx
        jnz label

The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".

UPD According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.

SOLVED @Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)

Peeples answered 29/12, 2011 at 22:36 Comment(11)

Are you looking for throughput or latency? Can we assume everything is in the L1 cache? L2? RAM? How fast is RAM? – Weaver 29/12, 2011 at 22:37

As Dietrich hints at, the difference almost certainly has to do with where the data is located. When the loop runs the fastest the data is already in L1 cache, when it runs slowest the data must be brought in from slower memory. – Lanilaniard 29/12, 2011 at 22:43

How do you measure these ticks per iteration? You just run the code under a modern windowed multitasking operating system and measure the time it takes to execute a large number of iterations? (And I am hinting at the pointlessness of trying to run a benchmark and expect any accuracy out of it nowadays.) – Ragland 29/12, 2011 at 22:45

See a related question: #8629233 – Ragland 29/12, 2011 at 22:47

I'd say much more unless %ecx starts of as 1. But assuming you mean per iteration: Such a question isn't all that easy to answer: Are cachemisses involved (likely for any measurable amount of iterations)? If so, how far do these misses go (so L2, L3 or main memory (or disk))? How many iterations does your loop typically take? Is that number constant (branch prediction)? How is that code block aligned?... (some points are unlikely to make an impact, but the point remains). But the main question is: why do you care? Does the exact tickcount (for that specific architecture) matter? – Charming 29/12, 2011 at 22:48

Dietrich Epp, everything is in the L1 certainly. Overall data size is about 1KB. – Peeples 29/12, 2011 at 22:57

MikeNakis, like this

// needed values in proper registers     clock_gettime(CLOCK_MONOTONIC, &t1);     [mov 100 to cx]     [code from the question]     clock_gettime(CLOCK_MONOTONIC, &t2);     // 32/10 - my cpu is 3.2 GHz     printf("%ld\n", (t2.tv_nsec - t1.tv_nsec)*32/10);

the output typically is 316, 310, 301 or 1120... – Peeples 29/12, 2011 at 23:4

@Peeples Note that latency is just one of many factors. Modern processors are able to run multiple instructions in parallel. So I can definitely see how it's possible to get 3 cycles/iteration. – Farreaching 30/12, 2011 at 6:11

@Mysticial, throughput parameter in mentioned table is used to count ticks with consideration of parallel execution, isn't it? – Peeples 30/12, 2011 at 6:18

Yes. I'll make an answer with more details in a few minutes. – Farreaching 30/12, 2011 at 6:20

Are you sure the CPU is running at 3.2 GHz all the time? – Laurustinus 30/12, 2011 at 13:11

I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.

(Note that this is for processors in general and I make no references specific to AMD processors.)

Key Concepts:

Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.

Let's break down your example:

label:
    mov (%rsi), %rax
    adc %rax, (%rdx)
    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx
    jnz label

The first thing to notice is that the last 3 instructions before the branch are all independent:

    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx

So it's possible for a processor to execute all 3 of these in parallel.

Another thing is this:

adc %rax, (%rdx)
lea 8(%rdx), %rdx

There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.

Same applies to the rsi register between:

mov (%rsi), %rax
lea 8(%rsi), %rsi

So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):

1:   mov (%rsi), %rax        lea 8(%rdx), %rdx        lea 8(%rsi), %rsi
2:   adc %rax, (%rdx)        dec %ecx
3:   jnz label

*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.

In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...

Farreaching answered 30/12, 2011 at 6:39 Comment(3)

There is one interesting fact i forgot to mention, that "number of cycles" doesn't vary within OS process execution, as the processor virtually chooses "speed" on the start of the task. See two real consequent executions (measurement code from my comment to the question is repeated 20 times in loop): pastie.org/3094509 – Peeples 30/12, 2011 at 7:22

That's definitely an interesting result - hard to explain without knowing the hard-core uncensored details of your processor. (which I don't) But as I mentioned, there's a ton of factors that could make it less than optimal. (On some Intels, mixing adc/sub with inc/dec would lead to some very bad stalls... Not sure on AMDs though.) – Farreaching 30/12, 2011 at 7:29

Ah... so it took an extra run to get the CPU out of power-saving mode... :) – Farreaching 30/12, 2011 at 18:20

At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Noctambulous answered 29/12, 2011 at 22:41 Comment(0)

Recommended topics

Hot tags