I won't try to answer with certainty how many cycles (3 or 10) it will take to run each iteration, but I'll explain how it might be possible to get 3 cycles per iteration.
(Note that this is for processors in general and I make no references specific to AMD processors.)
Key Concepts:
Most modern (non-embedded) processors today are both super-scalar and out-of-order. Not only can execute multiple (independent) instructions in parallel, but they can re-order instructions to break dependencies and such.
Let's break down your example:
label:
mov (%rsi), %rax
adc %rax, (%rdx)
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
jnz label
The first thing to notice is that the last 3 instructions before the branch are all independent:
lea 8(%rdx), %rdx
lea 8(%rsi), %rsi
dec %ecx
So it's possible for a processor to execute all 3 of these in parallel.
Another thing is this:
adc %rax, (%rdx)
lea 8(%rdx), %rdx
There seems to be a dependency on rdx
that prevents the two from running in parallel. But in reality, this is false dependency because the second instruction doesn't actually
depend on the output of the first instruction. Modern processors are able to rename the rdx
register to allow these two instructions to be re-ordered or done in parallel.
Same applies to the rsi
register between:
mov (%rsi), %rax
lea 8(%rsi), %rsi
So in the end, 3 cycles is (potentially) achievable as follows (this is just one of several possible orderings):
1: mov (%rsi), %rax lea 8(%rdx), %rdx lea 8(%rsi), %rsi
2: adc %rax, (%rdx) dec %ecx
3: jnz label
*Of course, I'm over-simplifying things for simplicity. In reality the latencies are probably longer and there's overlap between different iterations of the loop.
In any case, this could explain how it's possible to get 3 cycles. As for why you sometimes get 10 cycles, there could be a ton of reasons for that: branch misprediction, some random pipeline bubble...
%ecx
starts of as 1. But assuming you mean per iteration: Such a question isn't all that easy to answer: Are cachemisses involved (likely for any measurable amount of iterations)? If so, how far do these misses go (so L2, L3 or main memory (or disk))? How many iterations does your loop typically take? Is that number constant (branch prediction)? How is that code block aligned?... (some points are unlikely to make an impact, but the point remains). But the main question is: why do you care? Does the exact tickcount (for that specific architecture) matter? – Charming// needed values in proper registers clock_gettime(CLOCK_MONOTONIC, &t1); [mov 100 to cx] [code from the question] clock_gettime(CLOCK_MONOTONIC, &t2); // 32/10 - my cpu is 3.2 GHz printf("%ld\n", (t2.tv_nsec - t1.tv_nsec)*32/10);
the output typically is 316, 310, 301 or 1120... – Peeples