Branch-mispredictions versus cache misses [closed]
Asked Answered
B

1

0

Consider the following two alternative pieces of code:

Alternative 1:

if (variable != new_val) // (1)
    variable = new_val;

f(); // This function reads `variable`.

Alternative 2:

variable = new_val; // (2)
f(); // This function reads `variable`.

Which alternative is "statistically" faster? Assume variable is in cache L1 before (1) or (2).

I guess that alternative (1) is faster even if the branch-misprediction rate is high, but I don't really know the costs of "ifs". My guess is based on the assumption that cache-misses are way more expensive than branch-mispredictions but I don't really know.

What if variable wasn't in cache before (1) or (2)? Does it change the situation too much?

NOTE: Since the situation could change a lot among different CPUs, you can based your answer in an architecture you are familiar with, although widely used CPUs like any modern Intel architecture is preferred. The goal of my question is actually to know a bit more about how CPUs work.

Busterbustle answered 25/10, 2021 at 16:36 Comment(20)
No way to tell without benchmarking.Montagna
It almost certainly depends on the specific CPU.Montagna
Alt 1 can include alternative 2, as out of order execution, in which case the result is just discarded when the predicate does not hold. Based on this, I'd say that Alternative 2 is almost always more efficient. Efficiency is hard to pin point at this fine grain even with micro-benchmarks, since you'd have to also consider the side effects to the rest of the program, e.g., the mere act of prefetching assigns more workload to the prefetcher. Another point is that when doing the comparison you've already placed your variables in registers which would be a big part of the assignment alternativeKofu
Why do you think that version 1 could have less cache misses than version 2?Granulate
(1) is dependant on the previous value of new_val, which will require fetching it from cache if needed, whereas the compiler is allowed to completely disregard previous values in (2). I'd be surprised if (1) is faster unless the type of variable has a large sizeof() or has some side-effect producing assignment operations. But as always: don't assume, benchmark.Outrageous
In a multi-threading environment where variable represents an atomic variable of shared data between several threads with high thread contention, I would guess that alternative #1 is faster, because writes to shared data are very cache-unfriendly. However, in most situations, I would say that alternative #2 is faster. Either way, I recommend that you use benchmarking to determine which method is best in your particular situation.Austere
I added a note based on @Montagna suggestion.Busterbustle
I don't understand. They are functionally not the same. The first example involves a branch, verses the second which doesn't. The second version should be faster because branch prediction isn't initiated; it's a data instruction and processors love data instructions. Invoking branch prediction takes, time whereas not invoking branch prediction has no addition time.Rossiya
@Peregring-lk the cost of misprediction can be very high. Take into consideration pipeline flush.Cicatrize
Print out the assembly language. Note the difference in instructions.Rossiya
Remember, that variable can be placed into a register and thus affects whether the variable is cached or not. In my understanding, registers don't involve using the cache, except to load and store values. Thus there is a possibility that f() doesn't use the cache because the value is still in a registers. Depends on when the variable is used in f() and how the compiler generated the instructions.Rossiya
There are some processors which can conditionally execute instructions (like the ARM). In example 1, the assignment instruction could be a conditional instruction and either executed or ignored based on the status codes and the condition on the the data instruction.Rossiya
@ThomasMatthews How would the assembly code help? Branch prediction happens in hardware, not instructions.Montagna
@Barmar: Assembly language shows whether or not to invoke branch prediction. Usually branch prediction can be started using a compare instruction or a conditional branch instruction. The assembly language of the first example should show a branch instruction; whereas the second example should be void of the branch instruction (before the function call). This is what I mean by examining the assembly language.Rossiya
@ThomasMatthews Obviously there will be a conditional branch in the first vesion. The question is whether the speed-up from correct branch prediction makes up for the time it takes to do the comparison.Montagna
Note well that a C compiler would be free in many cases to compile both examples to the same machine code. And that common machine code might or might not include a conditional branch.Buckish
Does this answer your question? Strange optimization? in `libuv`. Please explainCoiffeur
Could also be considered a duplicate of thisCoiffeur
What types are the variables?Mediant
@JeremyFriesner Assume any primitive type. For example int or uint64_t, whatever it typedefs to.Busterbustle
T
3

Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.

If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.


Reading variable in the first place already touches memory for both variables (if neither is in registers). If you expect new_val to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.

Other than breaking the data dependency, if f() inlines and variable optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)

Note that later reloads of variable can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f() didn't inline and use the new_value load result directly, its use of variable still doesn't have to wait for a possible store miss on variable.


Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.

Two questions linked in comments by @EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst memory order are slow on most ISAs other than AArch64, draining the store buffer.)

Thumbscrew answered 26/10, 2021 at 3:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.