Adding a redundant assignment speeds up code when compiled without optimization
Asked Answered
H

1

9

I find an interesting phenomenon:

#include<stdio.h>
#include<time.h>

int main() {
    int p, q;
    clock_t s,e;
    s=clock();
    for(int i = 1; i < 1000; i++){
        for(int j = 1; j < 1000; j++){
            for(int k = 1; k < 1000; k++){
                p = i + j * k;
                q = p;  //Removing this line can increase running time.
            }
        }
    }
    e = clock();
    double t = (double)(e - s) / CLOCKS_PER_SEC;
    printf("%lf\n", t);
    return 0;
}

I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times: enter image description here There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:

movl    -44(%rbp), %eax
movl    %eax, -48(%rbp)

So why does the program run faster with such an assignment?


Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

Hoedown answered 9/3, 2018 at 8:41 Comment(15)
Do you build with or without optimizations enabled? Any kind of benchmarking without optimizations is borderline worthless.Orenorenburg
You could instruct gcc to only generate assembly, which is typically more readable than the disassembly (the term "decompile" is IMHO wrong) you have provided.Singlet
runs faster -- how much faster? Can you show the run times over say 10 trials?Disfigure
What is the error of your benchmark, are the differences significant?Bignoniaceous
You're benchmarking a debug build, which is basically useless. But if you want to know exactly why, the bottleneck will be all the store/reloads, probably a loop-carried dependency on k. If you're on Skylake, store/reload latency can actually be lower (better) when there's more in between the dependent pair (including other stores/loads)..Sniffle
If you're on Skylake, I think this is a duplicate of stackoverflow.com/questions/45442458/…Sniffle
Also, use p (and q) after the loops, otherwise a compiler might not generate code for those loops at all.Orenorenburg
@Someprogrammerdude I build with optimization zero -O0.Hoedown
@AjayBrahmakshatriya About 0.5 sec faster.Hoedown
So no optimization at all. Which as stated is not enough for benchmarking. Use at least -O2.Orenorenburg
@Hoedown 0.5 seconds faster over how much? Talk in percentages. What is variance in the base line if you run it 10 times?Disfigure
@PeterCordes Thanks. I will check this out.Hoedown
I'm voting to close this question as off-topic because questions about the performance of code compiled without any optimisation are not helpful.Warr
@TobySpeight - I disagree. Compiling without optimization isn't useful for performance analysis, but at the end of the day, regardless of the compiler settings, one might ask why one snippet of assembly emitted by the compiler is slower than another, despite the first one having strictly fewer statements. That alone can be interesting as Peter's answer shows.Jessen
Nice edit, now it's a well-written and interesting question. Now it really deserves the upvote I gave it earlier, and I wish I could upvote again. :)Sniffle
S
25

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.

None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.


You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.


But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)

If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.

You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.


The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.

If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.

So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.

This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.


Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.


The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)

Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)


Does this mean adding useless statements will speed up real programs? (with optimization enabled)

In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.

Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.

Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.


Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

Sniffle answered 9/3, 2018 at 9:22 Comment(16)
This question was barely worth answering, but I'm really close to a gold badge in the [performance] tag, so I was tempted to write some filler instead of waiting to close it as a dup. :P Hopefully it's not totally useless filler.Sniffle
Indeed, I was reading your answer and about to drop a comment saying basically what you just did, that questions about performance with -O0 are meaningless. Good answer anyways, best of luck with gold performance :)Reece
I try to compile and run on a raspberry pi and the result seems to support your answer. Although this question is useless, thank you for your answer.Hoedown
@PeterCordes By the way, actually, I do everything on a broadwell i5-5257U instead of skylake. Does that mean broadwell maybe has the same mechanism?Hoedown
Well the No, it doesn't statement is a bit strong :). Real world cases of this store-forwarding issue, like the "i7 loop anomaly" described here crop up with some frequency. Sure the compiler tries to keep things in registers, which would eliminate the issue (if it's the issue) in this case, but store-forwarding isn't always avoided by optimization: even with an infinitely smart compiler that can see across all function boundaries you'd still be left with data-dependent forwarding.Jessen
Function calls are another source: the push rsp, pop rsp inherent in call and ret often lead directly to store forwarding for short functions (that seemed to be the cause with the issue I linked above, which occurred at both -O2 and -O3). Beyond that there are plenty of weird cases where adding apparently redundant instructions speed things up even with optimization, by adjusting alignment, affecting predictors, changing scheduling, etc. There is, AFAIK, no trick to do that on average: it's all case-by-case.Jessen
@BeeOnRope: My point was that adding crap like int dummy=q; to the C source isn't going to fix that in optimized code. (And BTW, I think you mean push RIP / pop RIP inherent in call/ret.) But yeah, it can happen in compiler output sometimes, especially without link-time optimization to inline tiny functions.Sniffle
@helloqiu: Yes, and thanks for that data point. BeeOnRope's comment linked to a blog post from 2013 reporting basically the same thing on Haswell and Sandybridge. So it appears that store-forwarding latency depends on surrounding code on the whole range of Sandybridge-family CPUs, from SnB to SKL and presumably onwards.Sniffle
Yes, I meant rip. Of course the line int dummy = q will be compiled away if optimization is on, but similar dummy statements certainly can help in optimized code. I think maybe what we are getting at is that redundant or apparently useless/slower statements can speed things up, both in unoptimized and optimized code, but it is not necessarily the case that the same redundant statement speeds up the same code snippet in both modes. I.e., in the example above, the q = p line won't help in optimized mode, but a similar q = p line may help in optimized mode in some other snippet...Jessen
... but not in debug mode. Of course in optimized mode fewer redundant instructions in the source will remain in the first place, so sometimes you need a volatile there, an __attribute__ there, etc.Jessen
@Hoedown - I don't think this question is useless. You started at a big disadvantage by compiling without optimization, which is already a giant red flag for "why does performance of Y behave like Z" - but since the compiler only emitted extra instructions for your slower case, it turns out that it's an interesting question at the assembly level. I.e., you could almost remove the C origin of the question, and the fact that you compiled without optimizations, and ask about the behavior of the assembly, and probably avoid the downvote avalanche.Jessen
@BeeOnRope: note that call/ret doesn't create a loop-carried dependency, because the address pushed by call comes from speculative execution + branch prediction. Multiple store/reloads to the same address can sustain one per clock when the store is not data-dependent on the load. Executing the ret instructions can go one per clock, 5 cycles behind the call instructions. (Well, of course call/ret are both branches, so they compete with each other for execution resources, and thus don't even bottleneck on memory.) What could be a problem is a push/pop rbp , or x=foo(x) by ref.Sniffle
@PeterCordes - that question is interesting enough that we shouldn't hash it out in the comments. I asked about it over here. Hint: I don't know the answer.Jessen
@PeterCordes Someone has just told me that it's helpful to use perf to track the program. So I try it. It seems that with some mov operations generated from q = p, the percentage of cycles used by p = i + j * k instead of addl $1, -12(%rbp) decreases. So is it possible that the assignment makes imul faster instead of the addl in k+1?Hoedown
@helloqiu: that's not how performance works. Out-of-order pipelined CPUs mean that total run time isn't just the sum of how long each instruction takes on its own. See stackoverflow.com/questions/45113527/… for more about throughput vs. latency vs. execution-port bottlenecks. Also, the HW counters perf uses have limited accuracy, see stackoverflow.com/questions/48369347/…Sniffle
On most new hardware cycles:ppp should have high accuracy.Jessen

© 2022 - 2024 — McMap. All rights reserved.