Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?
Asked Answered
F

5

16

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:

#include <iostream>
#include <chrono>

#define NUM_ITERATIONS 100000000

int main()
{
    // Block 1: one operation in loop body
    {
        int64_t x = 0, y = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
    }

    // Block 2: two operations in loop body
    {
        int64_t x = 0, y = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
    }

    return 0;
}

I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:

1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000

I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.

Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?

Edit:

Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).

Fabian answered 29/5, 2020 at 15:17 Comment(43)
Can't reproduce on Quickbench.Schoolfellow
@geza Why would it take 0.3s to read two timepoints? Godbolt shouldn't be this slow.Gardy
There's some optimization in @FrançoisAndrieux link. I don't see any loops in the main section. Usually, in unoptimized code, you would see the instructions of the for loop overhead.Willettewilley
@Oliort Could you switch the order of loops and repeat your measurements again? Ie. first execute the loop with two variables, then with one.Millsap
@KamilCuk, yes, as I wrote "I rearranged these two code blocks" but result are similar.Fabian
Write two programs and test them. Testing one loop against another in the same program/thread, is not likely to produce the expected results. Your code has no warm-ups, so you are failing to account for instruction/data read-ahead/cache effects. You also have two loops referencing the same variable, which is bound to engage some processor pipe-lining shinanagins.Empty
Why the change from 31 to 17? With all the causes you have carefully excluded, I am wondering if this is somehow data-dependent. I would be interested in whether the odd behavior remains when both tests change x by the same amount.Metic
I've confirmed that on both quickbench and godbolt that at -O0, with tests alternating, the one with two variables consistently executes faster, despite no surprises in the generated code. There's definitely pipeline, caching, or data shenanigans somewhere.Trundle
@Metic , as I wrote, I "changed constants, changed operations", but the results were similar.Fabian
I would also suggest a token use of the now() method before the first test. Maybe there is undocumented overhead in the first use.Metic
@jwdonahue, done. Similar results.Fabian
@Oliort, yes, I did carefully read that you changed constants. What remained unclear is that your tests matched constants between tests 1 & 2.Metic
@donjuedo, about now(): but if I rearrange the first block with the second, the measurement results still saying the same: "two operations loop" is faster. So I don't think now() makes that much influence if any.Fabian
On whch CPUs did you reproduce the issue? Would you be able to reproduce this with a C program? Something along like this.Millsap
"I tested this with different levels of code optimization ... I always got similar result" vs. "Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized.". Which is true? And again, measuring performance of non-optimized builds doesn't make too much sense for current CPUs.Antiquary
@Millsap just tested this on Intel Core i7 5500, and it does't reproduce with your code (Your second loop is slower than first and third). My code was tested on this same CPU too. Hmmm what could that mean?Fabian
It's possible the two op loop triggers a pipe-lining feature that the single op loop does not. It's also possible the differences in increment sizes is a factor, though I have no theory as to why it would make a difference.Empty
@Empty I tried this with ++ and -- operators instead, it behaved the same way. About pipe-lining: how could we test that to know for sure?Fabian
Here's a theory: The two op loop has twice as many hits on the same cache line as the one op loop. It just might be that the loops are being interrupted by OS kernels and it happens that the cache line is more likely to be in-tact on the next execution of the thread.Empty
You are undoubtedly measuring effects that are difficult to quantify. I know the Windows kernel has some perf-counters that would probably be relevant here, and I would be surprised if Linux doesn't also, since they are grounded in actual hardware counters. You'd need to run thousands of tests and collect cache hit/miss and context switches, in order to find the correlation.Empty
It's been about 10 years since I did any seriously detailed perf testing in Windows kernel. I seem to recall that you can accumulate CPU stall "times" in counter ticks. You also need to either lock-down the processor clock speed, or account for its variability as the CPU frequency is adjusted to prevent overheating. And then there are the cases where the load is simply shifted to a colder core able to run at full-speed for a while.Empty
Higher load threads often share a core with fewer other threads, while low load threads are often scheduled on cores just to even-out the temperature on the die. Hardware and kernel interact in ways that might seem unpredictable, without deep knowledge of the kernel, hardware architecture and BIOS/kernel configuration settings. Because of the role that temperature and humidity might be playing in your scenario, I would not be surprised if you failed to reproduce the effect you are seeing, next winter.Empty
@Empty I actually saw it more than a year ago (maybe other season =) ). But this came to my mind today, I tested again and this time I was too interested not to ask.Fabian
My friend tested my code on his MacBook. The same behaviour.Fabian
The concerns about why we're benchmarking unoptimized code might be reduced by skipping the source and compilation, and asking why adding one assembly instruction in a simple-implementation loop for various hardwares gives the results.Affair
This can be reproduced with optimized builds, with volatile variables: quick-bench.com/9vB3Rm2q5xvrmFqg6P7Y2141RQ0 . I wonder whether this is an Intel only thing, or it happens on AMD as well?Antiquary
@KorelK, casting to chrono::nonoseconds did't change anything, checked 10 times on 3 different Intel CPU computers. Behaviour for <double> is well documented: If Rep is floating point, then the duration can represent fractions of ticks.. No matter to check nonoseconds, difference I see is about 10% of execution time, so for ~10 seconds run the difference is about 1 second.Fabian
@Taekahn, also please increase number of iterations 10-100 times for the results to be more accurate.Fabian
@Taekahn, yeah I also reproduced different results with online compilers, but if I increase number of iterations, the results are completely in favour of what is reported in the question. Did you try with #define NUM_ITERATIONS 10000000000? Check that please. Here is the link: onlinegdb.com/SJhCTqznI. Also results are consistent for my hardware even without that change.Fabian
I tried it locally ask you asked, i had to chop off that extra zero because it was taking more than half an hour to complete and i had no idea how long to expect it to run. Anyways, with the 1 billion iterations (instead of 10 billion), the single increment loop is performing faster on my machine every time. This is using MSVC as i don't have anything else installed on this computer.Golfer
@Golfer what is your CPU?Fabian
Intel Core i7 6700K @ 4 GHz. Windows 10. MSCV 2017 v141.Golfer
@MooingDuck: This result is expected for gcc -O0 on Intel Sandybridge-family. Adding a redundant assignment speeds up code when compiled without optimization. But the OP claims to see it with optimization, which surprises me. With good uop scheduling, loop counter + 1 or 2 other ALU operations per iteration should run at 1 cycle / iteration on modern x86. The question doesn't say which build options and CPU model the times in the question are from. :(Lw
@Oliort: If you're going to include a Godbolt link, link to your code on Godbolt. (Use the "share" button at the top right). And please make it clear if the speed ratio between the loops changes depending on CPU, and what CPU + GCC options the times are from.Lw
@PeterCordes, I tested this on different Intel CPUs (Core i5, Core i7). Also I can't recheck that now but I remember I reproduced it on RaspberryPi. About Godbolt: I used -O0. About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure. I just run it several times with the code in the question (and yes, it reproduced) and included a note to the post. About difference depending on CPU: I didn't measure it too precisely but the difference looked very similar everywhere.Fabian
@PeterCordes, about "Adding a redundant assignment speeds up code when compiled without optimization": Will it work the same for volatiles and optimized builds?Fabian
i5 or i7 only slightly narrows down what microarchitecture you used, only that it's Intel (not AMD), and Nehalem or newer. So something since 2009 or so. For example, i7-6700k is Skylake, which is different from i7-2xxx Sandybridge.Lw
@Oliort: Yes, volatile to force the variable into memory will make similar asm with optimization enabled, resulting in the same store-forwarding effect. Same for Google Benchmark's benchmark::DoNotOptimize(x1 += 31) ; unfortunately that de-optimizes it all the way to memory, not a register.Lw
I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure. So you didn't actually reproduce a measurable effect with optimization enabled. You just got some noise from startup overhead around an empty timed region and assumed it would be the same as with -O0. Asm from -O0 usually has very different bottlenecks than -O2 or -O3 in code that doesn't optimize away.Lw
Sounds like your question is fully a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, and Idiomatic way of performance evaluation? for the benchmarking methodology error with optimization enabled: if time doesn't scale with the repeat count, you're not measuring anything.Lw
@PeterCordes last time I run this it was i7 5500. I'll recheck rest of the computers as soon as I reach them. About duplicate: it seems so. Anyway I'm very glad to ask the question as it was very aducative and interesting and I was not able to find the answer myself. Thanks.Fabian
@PeterCordes i7 5500 is Broadwell microarchitecture as far as I understand. Does it have the same bottleneck?Fabian
Yes, all pipelined CPUs with a store buffer have latency bottlenecks when you store / reload, that's part of why you shouldn't benchmark -O0 code. But yes, Broadwell is part of Sandybridge-family (i7-2xxx and later, including modern Skylake-derived uarches) and has that funky variable-latency store forwarding where it can be faster if you don't try to load too son.Lw
L
11

This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.

On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.

This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.

This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.

Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

See @rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.

With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately @SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)


Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.


I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result

I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.

(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.

So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.

See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.

I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.


The loop with optimization enabled:

As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:

orig_main:
   ...
        call    std::chrono::_V2::system_clock::now()       # demangled C++ symbol name
        mov     rbp, rax                                    # save the return value = start
        call    std::chrono::_V2::system_clock::now()
        # end in RAX

So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.

With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:

#include <benchmark/benchmark.h>

static void TargetFunc(benchmark::State& state) {
   uint64_t x2 = 0, y2 = 0;
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    benchmark::DoNotOptimize(x2 += 31);
    benchmark::DoNotOptimize(y2 += 31);
  }
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);
# just the main loop, from gcc10.1 -O3 
.L7:                         # do{
        add     rax, 31        # x2 += 31
        add     rdx, 31        # y2 += 31
        sub     rbx, 1
        jne     .L7          # }while(--count != 0)

I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.

For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.

static void TargetFunc(benchmark::State& state) {
   uint64_t x2 = 0, y2 = 0;
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
      x2 += 16;
      y2 += 17;
    asm volatile("" : "+r"(x2), "+r"(y2));
  }
}

All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see @rcgldr's answer.

Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.

You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.


The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away

TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.

Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).

Lw answered 4/6, 2020 at 0:51 Comment(0)
D
6

ETA: This was a guess, and Peter Cordes has made a very good argument about why it's incorrect. Go upvote Peter's answer.

I'm leaving my answer here because some found the information useful. Though this doesn't correctly explain the behavior seen in the OP, it highlights some of the issues that make it infeasible (and meaningless) to try to measure the speed of a particular instruction on a modern processor.


Educated guess:

It's the combined effect of pipelining, powering down portions of a core, and dynamic frequency scaling.

Modern processors pipeline so that multiple instructions can be executing at the same time. This is possible because the processor actually works on micro-ops rather than the assembly-level instructions we usually think of as machine language. Processors "schedule" micro-ops by dispatching them to different portions of the chip while keeping track of the dependencies between the instructions.

Suppose the core running your code has two arithmetic/logic units (ALUs). A single arithmetic instruction repeated over and over requires only one ALU. Using two ALUs doesn't help because the next operation depends on completion of the current one, so the second ALU would just be waiting around.

But in your two-expression test, the expressions are independent. To compute the next value of y, you do not have to wait for the current operation on x to complete. Now, because of power-saving features, that second ALU may be powered down at first. The core might run a few iterations before realizing that it could make use of the second ALU. At that point, it can power up the second ALU and most of the two-expression loop will run as fast as the one-expression loop. So you might expect the two examples to take approximately the same amount of time.

Finally, many modern processors use dynamic frequency scaling. When the processor detects that it's not running hard, it actually slows its clock a little bit to save power. But when it's used heavily (and the current temperature of the chip permits), it might increase the actual clock speed as high as its rated speed.

I assume this is done with heuristics. In the case where the second ALU stays powered down, the heuristic may decide it's not worth boosting the clock. In the case where two ALUs are powered up and running at top speed, it may decide to boost the clock. Thus the two-expression case, which should already be just about as fast as the one-expression case, actually runs at a higher average clock frequency, enabling it to complete twice as much work in slightly less time.

Given your numbers, the difference is about 14%. My Windows machine idles at about 3.75 GHz, and if I push it a little by building a solution in Visual Studio, the clock climbs to about 4.25GHz (eyeballing the Performance tab in Task Manager). That's a 13% difference in clock speed, so we're in the right ballpark.

Deadbeat answered 1/6, 2020 at 17:14 Comment(20)
Very well written answer. I have a general understanding of processor pipelines but I had never heard of dynamic frequency scaling. Thank you.Handout
So.. it could be proven when a OS (or bios) disables frequency scaling. So, would something along echo performance | sudo tee /sys//devices/system/cpu/cpu*/cpufreq/scaling_governor make a difference in measurements?Millsap
The case can be reproduced with fixed frequency, so it is not caused by frequency scaling. "So you might expect the two examples to take approximately the same amount of time.". It doesn't take the same amount of time, but the two operations version is faster.Antiquary
@geza: As I said, it's just an educated guess that seems to fit with the facts. If you have a repro showing that the two-expression loop is faster on a fixed frequency processor or another hypothesis that can explain the observations, please share.Deadbeat
I can reproduce it on my machine with fixed frequency. But even, without fixed frequency, if your theory was correct, then changing the order of the test should change which version is faster. But it doesn't change. quick-bench can reproduce it as well: quick-bench.com/Qu1l1gOrIlfyd_z9BQcxrw97YSUAntiquary
At some optimization level, I'd expect the compiler to replace the loops with simple assignments, in the first loop just x = 3100000000, making the whole explanation irrelevant.Burned
@AdrianMcCarthy: Good guess, but no. Low ILP is not a reason for clocking down on current x86 CPUs. If anything, it would be a reason for clocking even higher, because some pipeline stages are making less heat than they otherwise might.Lw
Memory bottlenecks will sometimes get Skylake CPUs to downclock themselves, though. When a Skylake core spends most of its time waiting for data from off-core (L3 or DRAM), it will decide it can maybe downclock without hurting performance much. (This isn't always true; unlike DRAM, uncore / L3 runs on the same clock as the cores so its latency/bandwidth scales.) But the OP's code is just modifying registers. Or if unoptimized, store/reload that hits in the store buffer. (Variable-latency store-forwarding is probably the answer).Lw
And BTW, the loop has a loop counter, so it's 2 vs. 3 ALU ports busy every clock cycle if this compiles naively without unrolling or auto-vectorization like gcc -O1 or -O2. The loop runs long enough that ramp-up to max turbo should happen pretty quickly, and only be a small part of the extra time for the first loop. So the loop tested first will can be slower because of insufficient warm-up.Lw
I didn't explicitly mention before, but full hardware control of CPU frequency was new in Skylake. Before that, OS chose frequency other than turbo. Skylake may have introduced more fine-grained monitoring of what the code is doing. The OS only knows that the CPU is busy. Still, downclocking while working on loops with store-forwarding or just ALU latency bottlenecks would not make any sense; IPC won't improve at lower clocks so it's just pure slowdown.Lw
@Peter Cordes: Fair enough. It is just a guess, but I still don't see another hypothesis that matches well with the observations listed in the OP. Frequency scaling is tuned to save power, which is strongly correlated with speed (run fast then power down). But power consumption goes superlinearly with frequency, so it seems plausible there could be cases where increasing the clock speed isn't deemed worthwhile unless the core is already well utilized. The OP asks how to measure the speed of a particular instruction (thus /O0), and the issues here show why that's hard and not useful.Deadbeat
Turns out the OP only sees this effect at -O0, so as I suspected it's a duplicate of Adding a redundant assignment speeds up code when compiled without optimization for Sandybridge-family CPUs. The claims of seeing it at -O1 and higher were just confirmation bias from measuring an empty timed region when the loop optimized away.Lw
I wrote up the details in an answer to this question, in case you're interested.Lw
Yes, CPUs do dynamic voltage as well, running at the minimum safe level, so higher frequency means higher voltage, with dynamic power scaling with f^3. (Rather than just f at constant voltage). But it also scales with the number of transistors actually switching every clock cycle, which is lower in low-throughput code. Code that's not waiting for memory will take a more or less constant number of clock cycles to finish.Lw
It makes no sense to only turbo to max when running power-intensive or high-IPC workloads; latency-bound workloads benefit just as much and you can sustain higher boost clocks longer. The damage was done at compile time, and it's the CPUs job to run clunky code as fast as it can, too. Having the CPU slow down as well would add insult to injury and be the opposite of sensible, unless the code was actually spending time waiting for cache misses from off core (in a different clock domain that would work just as fast if the core was slower).Lw
BTW, you can measure performance of a single instruction on modern CPUs. But it's not a single "speed" number so you could argue that's a correct statement. Performance is fully characterized for most instructions by fused-domain uops (front-end cost), back-end ports the uop(s) can run on, and latency of those uops. uops.info (Normal ALU uops other than div/idiv are fully pipelined; a new one can start every cycle. Some exceptions like multiply on AMD before Zen, and SIMD on low-power Silvermont.)Lw
Also, didn't notice this before: no x86 CPUs I'm aware of actually power down execution ports for low-ILP code. Powering up a block of logic is quite slow and produces wild voltage swings on outputs, so you'd have to halt the clock while powering up again. It's only worth doing if the core is going into a sleep state because it takes tens of thousands of clock cycles. What does happen is clock-gating idle logic; that can be ungated without any delay. You still get leakage current but no dynamic (switching) current.Lw
Powering down some ports wouldn't work well: OoO exec is usually bursty after a cache miss. And not all ALUs are equal. For example, on Sandybridge-family, only port 1 can run integer uops with latency = 3 (like imul, slow-LEA, popcnt, bsf, etc.) Only port 6 can run taken branch uops. Only port 0 can run movd r32, xmm. Only port 5 can run SIMD shuffles, movd xmm, r32. Integer shifts only run on p0 / p6. LEA only runs on p1 / p5. Also, in your analysis, you're counting the OP's loops as 1 or 2 operations. But actually it's 2 or 3; don't forget the loop counter.Lw
On Skylake-AVX512 (SKX and Cascade Lake), having any 512-bit uops in flight means the vector ALUs on port 1 are shut down (clock gates, probably not literally voltage cut). It can still run scalar integer stuff, but no X/Y/ZMM instructions. And when the current turbo frequency is above the "license" level for some kinds of instructions, the scheduler limits their throughput. We used to think this was some kind of "power up the high half of execution units", but I think intentional throttling to avoid a turbo frequency shift for a short loop using AVX2 or AVX512 is more likely.Lw
BeeOnRope explains in SIMD instructions lowering CPU frequency. The earlier "warm up upper 128" / power-down theory was proposed by Agner Fog to explain his test results for that warm-up effect on Skylake: agner.org/optimize/blog/read.php?i=415Lw
P
5

I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.

I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.

There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.

Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.

My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.

        .code
        public  test1
        align   16
test1   proc
        sub     rsp,16
        mov     qword ptr[rsp+0],0
        mov     qword ptr[rsp+8],0
tst10:  add     qword ptr[rsp+8],17
        dec     rcx
        jnz     tst10
        add     rsp,16
        ret     
test1   endp

        public  test2
        align 16
test2   proc
        sub     rsp,16
        mov     qword ptr[rsp+0],0
        mov     qword ptr[rsp+8],0
tst20:  add     qword ptr[rsp+0],17
        add     qword ptr[rsp+8],-37
        dec     rcx
        jnz     tst20
        add     rsp,16
        ret     
test2   endp

        end

I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).

3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.

4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.

        .code
        public  test1
        align   16
test1   proc
        xor     rdx,rdx
        xor     r8,r8
        xor     r9,r9
        xor     r10,r10
        xor     r11,r11
tst10:  add     rdx,17
        dec     rcx
        jnz     tst10
        ret     
test1   endp

        public  test2
        align 16
test2   proc
        xor     rdx,rdx
        xor     r8,r8
        xor     r9,r9
        xor     r10,r10
        xor     r11,r11
tst20:  add     rdx,17
        add     r8,-37
        dec     rcx
        jnz     tst20
        ret     
test2   endp

        public  test3
        align 16
test3   proc
        xor     rdx,rdx
        xor     r8,r8
        xor     r9,r9
        xor     r10,r10
        xor     r11,r11
tst30:  add     rdx,17
        add     r8,-37
        add     r9,47
        dec     rcx
        jnz     tst30
        ret     
test3   endp

        public  test4
        align 16
test4   proc
        xor     rdx,rdx
        xor     r8,r8
        xor     r9,r9
        xor     r10,r10
        xor     r11,r11
tst40:  add     rdx,17
        add     r8,-37
        add     r9,47
        add     r10,-17
        dec     rcx
        jnz     tst40
        ret     
test4   endp

        end
Provisional answered 1/6, 2020 at 19:22 Comment(4)
This is simulating the un-optimized code, with memory-destination adds. Optimizing the vars into registers like gcc -O1 or higher would remove store-forwarding bottlenecks. The -O0 case would probably be a duplicate of Adding a redundant assignment speeds up code when compiled without optimizationLw
@PeterCordes - I tested that as well (add immediate to register instead of to memory), similar result. I updated my answer to show those examples.Provisional
Your Ivy Bridge CPU has 3 ports that can run integer ALU uops. It should run 2x add and a macro-fused dec/jnz at 1/clock. So that explains identical performance for both loops. IDK why you didn't see a difference with the memory version. But with registers, adding a 3rd add should bottleneck on the back-end, averaging 1.33c per iteration. Adding a 4th add (total of 5 uops) should bottleneck on the front-end, slowing down to 2c per iteration, unlike HSW: Is performance reduced for loops whose uop count is not a multiple of processor width?Lw
@PeterCordes - 3 register case ended up effectively 1.5 c per iteration, 4 was 2.0c per iteration. For the add to memory case, I'm thinking the bottle neck would be cache / memory write times. I have Ivy Bridge CPU, but Sandy Bridge motherboard (DP67BG).Provisional
B
2

@PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.

I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.

But it seems that @Adrian McCarthy got it right with the dynamic frequency scaling.

Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.

http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA

I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case. http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0

Edit: Thanks to @PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.

Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).

Buff answered 1/6, 2020 at 18:4 Comment(10)
Usually you insert NOPs before a loop, not inside it, to align the start. And you'd use 1 or 2 long NOPs, up to 15 bytes each, not multiple short NOPs that each have to decode separately; that's testing the front-end and uop cache. (Or to align the end of the loop, on CPUs with the microcode workaround for Intel's JCC erratum, leading to slowdowns if a macro-fused JCC touches a 32-byte boundary: 32-byte aligned routine does not fit the uops cache). And BTW, -Ofast for GCC/clang is just -O3 -ffast-math.Lw
Using benchmark::DoNotOptimize(x1 += 31) forces x to be stored / reloaded from memory even with optimization. (godbolt.org/z/ajs_7M is simplified from your QuickBench link). That explains why so many NOPs don't make much difference: they can execute out-of-order, hidden by the latency of store-forwarding. Your version is a duplicate of Adding a redundant assignment speeds up code when compiled without optimization - Intel Sandybridge-family CPUs have variable-latency store forwarding that's faster if you don't try to reload too soon.Lw
@PeterCordes While I agree with you, quick-bench have shown the assembly with addq, not loading from memory, so I hope it is godbolt problem, and not quick bench lying about its asm :) Sadly I know little about front-end and uop cache, I used multiple short NOPs to go from 1 NOP to as high as it makes sense performance-wise, so in my low-educated tooling it was the go-to measurement unit :) Yep, I suppose my answer is duplicate, but specifics differ: optimizations and opcodeBuff
I get an "Error or timeout" from "Record disassembly" on QuickBench for your link; Godbolt is the only option. Where did you see something other than add qword ptr [rip + x2], 31 for the inner loop?Lw
I don't mean you answer is a duplicate, I mean the asm created by your benchmark has the same bottleneck as being discussed there. (And same as this question's -O0 version. But it turns out there wasn't a real effect at -O1 or higher, the OP was making that up. See my answer on this question for a more coherent write-up of what's going on, my comments were a bit scattered.)Lw
I was able to get QB asm by removing some of the functions (quick-bench.com/PyBaTT7vfcdKZRFHT8kEzzeh1oE). It's identical to Godbolt, but in AT&T syntax. Notice the addq $0x1f,0x396b8(%rip) # 249850 <x1> instruction before the nop. That's a memory destination (to a global variable because you made them global for some crazy rason). The add $0xffffffffffffffff,%rbx / jne at the bottom of the loop is the loop counter. Is that what you were looking at before?Lw
Yes, @PeterCordes, it is indeed memory access being the reason for optimizations not working as expected. I have updated the answer and added new benchmark showing equal performance no matter the NOPs! It is still funny how memory access case could be sped up by NOPs.Buff
NOPs leading to a speedup only happens with volatile or un-optimized code. In normal optimized code it's fairly rare to have repeated store/reload to the same address. And usually the effect only happens on Sandybridge-family Intel CPUs, not in general. It's definitely not something that's useful for tuning normal code. If it's worth messing around with hand-tuning asm on this scale, there are almost always other tricks you can use instead. (e.g. for a histogram, having multiple arrays of counters to distribute repeated runs of the same value across separate buckets)Lw
It's not global vs. local variable that's causing benchmark::DoNotOptimize on clang to use actual memory instead of just a register. We see the same thing with local vars: godbolt.org/z/iSF-gw. Note the add qword ptr [rsp + 8], 31. Only the addressing mode is different: relative to RSP (the stack pointer), instead of global. Your updated QB link with locals has a mov load, register add, mov store, for Func15. IDK why clang thought that was a good idea, but it doesn't change anything. Memory-destination add decodes to those load+add+store uops anyway.Lw
@PeterCordes I have updated the answer to prevent misinformation, very good comments sir!Buff
A
1

Processors are so complex these days that we can only guess.

The assembly emitted by your compiler is not what is really executed. The microcode/firmware/whatever of your CPU will interpret it and turn it into instructions for its execution engine, much like JIT languages such as C# or java do.

One thing to consider here is that for each loop, there is not 1 or 2 instructions, but n + 2, as you also increment and compare i to your number of iteration. In the vast majority of case it wouldn't matter, but here it does, as the loop body is so simple.

Let's see the assembly :

Some defines:

#define NUM_ITERATIONS 1000000000ll
#define X_INC 17
#define Y_INC -31

C/C++ :

for (long i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }

ASM :

    mov     QWORD PTR [rbp-32], 0
.L13:
    cmp     QWORD PTR [rbp-32], 999999999
    jg      .L12
    add     QWORD PTR [rbp-24], 17
    add     QWORD PTR [rbp-32], 1
    jmp     .L13
.L12:

C/C++ :

for (long i = 0; i < NUM_ITERATIONS; i++) {x+=X_INC; y+=Y_INC;}

ASM:

    mov     QWORD PTR [rbp-80], 0
.L21:
    cmp     QWORD PTR [rbp-80], 999999999
    jg      .L20
    add     QWORD PTR [rbp-64], 17
    sub     QWORD PTR [rbp-72], 31
    add     QWORD PTR [rbp-80], 1
    jmp     .L21
.L20:

So both Assemblies look pretty similar. But then let's think twice : modern CPUs have ALUs which operate on values which are wider than their register size. So there is a chance than in the first case, the operation on x and i are done on the same computing unit. But then you have to read i again, as you put a condition on the result of this operation. And reading means waiting.

So, in the first case, to iterate on x, the CPU might have to be in sync with the iteration on i.

In the second case, maybe x and y are treated on a different unit than the one dealing with i. So in fact, your loop body runs in parallel than the condition driving it. And there goes your CPU computing and computing until someone tells it to stop. It doesn't matter if it goes too far, going back a few loops is still fine compared to the amount of time it just gained.

So, to compare what we want to compare (one operation vs two operations), we should try to get i out of the way.

One solution is to completely get rid of it by using a while loop: C/C++:

while (x < (X_INC * NUM_ITERATIONS)) { x+=X_INC; }

ASM:

.L15:
    movabs  rax, 16999999999
    cmp     QWORD PTR [rbp-40], rax
    jg      .L14
    add     QWORD PTR [rbp-40], 17
    jmp     .L15
.L14:

An other one is to use the antequated "register" C keyword: C/C++:

register long i;
for (i = 0; i < NUM_ITERATIONS; i++) { x+=X_INC; }

ASM:

    mov     ebx, 0
.L17:
    cmp     rbx, 999999999
    jg      .L16
    add     QWORD PTR [rbp-48], 17
    add     rbx, 1
    jmp     .L17
.L16:

Here are my results:

x1 for: 10.2985 seconds. x,y = 17000000000,0
x1 while: 8.00049 seconds. x,y = 17000000000,0
x1 register-for: 7.31426 seconds. x,y = 17000000000,0
x2 for: 9.30073 seconds. x,y = 17000000000,-31000000000
x2 while: 8.88801 seconds. x,y = 17000000000,-31000000000
x2 register-for :8.70302 seconds. x,y = 17000000000,-31000000000

Code is here: https://onlinegdb.com/S1lAANEhI

Apatite answered 2/6, 2020 at 21:10 Comment(11)
modern CPUs have APUs (you mean ALUs) which operate on values which are wider than their register size. Yes, but you have to use SIMD manually, by running an instruction like PADDQ xmm0, xmm1. The CPU hardware won't fuse and auto-vectorize scalar add instructions for you. stackoverflow.com/tags/sse/infoLw
All of your loops bottleneck on a memory-destination add, which includes store-forwarding latency (~5 cycles, creating a 6 cycle loop-carried dep chain instead of 1 for a register). Related: Adding a redundant assignment speeds up code when compiled without optimization. Yes, register long i does have an effect in un-optimized code, but you forgot to use it for x as well.Lw
Also, all your loops are inefficient with two jumps, not conditional at the bottom: Why are loops always compiled into "do...while" style (tail jump)? Try gcc instead of g++, which makes loops differently at -O0 IIRC.Lw
For more background on guessing performance from looking at asm: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?Lw
You mention the CPU internally JITing the machine code. Transmeta Crusoe truly did dynamic recompilation. But modern x86 CPUs decode each instruction separately into 1 (or sometimes more) uops, the only exception being fusing a cmp/jcc or dec/jcc into a single uop. agner.org/optimize. Those uops can execute out-of-order, but for example, like I said earlier multiple add instructions won't fuse into a SIMD addition.Lw
Well, the OP asks for why this situation is in its current form. If you change both the code and the compiler, I guess that wouldn't answer the question.Chelate
And yes, I mean ALU, not APU. Edited.Chelate
My point about changing the compiler or more efficient loops was that the OP is claiming they still saw this effect with -O2 and -O3 which don't make stupid loops. You could maybe come closer to simulating that with register int with gcc instead of g++. (Although in this case they optimize the loop away entirely; godbolt.org/z/CMFnbx. So I have to call bullshit on the OP's claims for optimized builds. It might just be measuring extra overhead on the first call to now() in back-to-back calls with no loop in between.)Lw
@PeterCordes You make a technical point in somewhat strong language. In order to avoid undeserved wrong kind of attention, would you like to rephrase?Curious
@PeterCordes, about bullshit and now(): yes it might be. See my answer to your comment on my question. Feel free to edit.Fabian
@Yunnosch: Mistakenly making a wrong claim doesn't make someone a bad person. The claim is bullshit, as confirmed by the OP. Or to put it in more neutral language, with -O1 or higher GCC removes the loop entirely, leading to an empty timed region. Any conclusion based on startup overhead / noise is not meaningful and totally separate from the real effect visible on Sandybridge-family CPUs at -O0, with a store/reload bottleneck.Lw

© 2022 - 2024 — McMap. All rights reserved.