Branch-mispredictions versus cache misses [closed]

Consider the following two alternative pieces of code:

Alternative 1:

if (variable != new_val) // (1)
    variable = new_val;

f(); // This function reads `variable`.

Alternative 2:

variable = new_val; // (2)
f(); // This function reads `variable`.

Which alternative is "statistically" faster? Assume variable is in cache L1 before (1) or (2).

I guess that alternative (1) is faster even if the branch-misprediction rate is high, but I don't really know the costs of "ifs". My guess is based on the assumption that cache-misses are way more expensive than branch-mispredictions but I don't really know.

What if variable wasn't in cache before (1) or (2)? Does it change the situation too much?

NOTE: Since the situation could change a lot among different CPUs, you can based your answer in an architecture you are familiar with, although widely used CPUs like any modern Intel architecture is preferred. The goal of my question is actually to know a bit more about how CPUs work.

Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.

If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.

Reading variable in the first place already touches memory for both variables (if neither is in registers). If you expect new_val to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.

Other than breaking the data dependency, if f() inlines and variable optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)

Note that later reloads of variable can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f() didn't inline and use the new_value load result directly, its use of variable still doesn't have to wait for a possible store miss on variable.

Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.

Two questions linked in comments by @EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst memory order are slow on most ISAs other than AArch64, draining the store buffer.)

Recommended topics

Hot tags