Performance of modern processor
Asked Answered
P

2

6

Being executed on modern processor (AMD Phenom II 1090T), how many clock ticks does the following code consume more likely : 3 or 11?

label:  mov (%rsi), %rax
        adc %rax, (%rdx)
        lea 8(%rdx), %rdx
        lea 8(%rsi), %rsi
        dec %ecx
        jnz label

The problem is, when I execute many iterations of such code, results vary near 3 OR 11 ticks per iteration from time to time. And I can't decide "who is who".

UPD According to Table of instruction latencies (PDF), my piece of code takes at least 10 clock cycles on AMD K10 microarchitecture. Therefore, impossible 3 ticks per iteration are caused by bugs in measurement.

SOLVED @Atom noticed, that cycle frequency isn't constant in modern processors. When I disabled in BIOS three options - Core Performance Boost, AMD C1E Support and AMD K8 Cool&Quiet Control, consumption of my "six instructions" stabilized on 3 clock ticks :-)

Peeples answered 29/12, 2011 at 22:36 Comment(11)
Are you looking for throughput or latency? Can we assume everything is in the L1 cache? L2? RAM? How fast is RAM?Weaver
As Dietrich hints at, the difference almost certainly has to do with where the data is located. When the loop runs the fastest the data is already in L1 cache, when it runs slowest the data must be brought in from slower memory.Lanilaniard
How do you measure these ticks per iteration? You just run the code under a modern windowed multitasking operating system and measure the time it takes to execute a large number of iterations? (And I am hinting at the pointlessness of trying to run a benchmark and expect any accuracy out of it nowadays.)Ragland
See a related question: #8629233Ragland
I'd say much more unless %ecx starts of as 1. But assuming you mean per iteration: Such a question isn't all that easy to answer: Are cachemisses involved (likely for any measurable amount of iterations)? If so, how far do these misses go (so L2, L3 or main memory (or disk))? How many iterations does your loop typically take? Is that number constant (branch prediction)? How is that code block aligned?... (some points are unlikely to make an impact, but the point remains). But the main question is: why do you care? Does the exact tickcount (for that specific architecture) matter?Charming
Dietrich Epp, everything is in the L1 certainly. Overall data size is about 1KB.Peeples
MikeNakis, like this // needed values in proper registers clock_gettime(CLOCK_MONOTONIC, &t1); [mov 100 to cx] [code from the question] clock_gettime(CLOCK_MONOTONIC, &t2); // 32/10 - my cpu is 3.2 GHz printf("%ld\n", (t2.tv_nsec - t1.tv_nsec)*32/10); the output typically is 316, 310, 301 or 1120...Peeples
@Peeples Note that latency is just one of many factors. Modern processors are able to run multiple instructions in parallel. So I can definitely see how it's possible to get 3 cycles/iteration.Farreaching
@Mysticial, throughput parameter in mentioned table is used to count ticks with consideration of parallel execution, isn't it?Peeples
Yes. I'll make an answer with more details in a few minutes.Farreaching
Are you sure the CPU is running at 3.2 GHz all the time?Laurustinus
F
8

I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.

(Note that this is for processors in general and I make no references specific to AMD processors.)

Key Concepts:

Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.

Let's break down your example:

label:
    mov (%rsi), %rax
    adc %rax, (%rdx)
    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx
    jnz label

The first thing to notice is that the last 3 instructions before the branch are all independent:

    lea 8(%rdx), %rdx
    lea 8(%rsi), %rsi
    dec %ecx

So it's possible for a processor to execute all 3 of these in parallel.

Another thing is this:

adc %rax, (%rdx)
lea 8(%rdx), %rdx

There seems to be a dependency on rdx that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually depend on the output of the first instruction. Modern processors are able to rename the rdx register to allow these two instructions to be re-ordered or done in parallel.

Same applies to the rsi register between:

mov (%rsi), %rax
lea 8(%rsi), %rsi

So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):

1:   mov (%rsi), %rax        lea 8(%rdx), %rdx        lea 8(%rsi), %rsi
2:   adc %rax, (%rdx)        dec %ecx
3:   jnz label

*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.

In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...

Farreaching answered 30/12, 2011 at 6:39 Comment(3)
There is one interesting fact i forgot to mention, that "number of cycles" doesn't vary within OS process execution, as the processor virtually chooses "speed" on the start of the task. See two real consequent executions (measurement code from my comment to the question is repeated 20 times in loop): pastie.org/3094509Peeples
That's definitely an interesting result - hard to explain without knowing the hard-core uncensored details of your processor. (which I don't) But as I mentioned, there's a ton of factors that could make it less than optimal. (On some Intels, mixing adc/sub with inc/dec would lead to some very bad stalls... Not sure on AMDs though.)Farreaching
Ah... so it took an extra run to get the CPU out of power-saving mode... :)Farreaching
N
2

At Intel, Dr. David Levinthal's "Performance Analysis Guide" investigates the answers to such questions in great detail.

Noctambulous answered 29/12, 2011 at 22:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.