Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.
If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.
Reading variable
in the first place already touches memory for both variables (if neither is in registers). If you expect new_val
to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable
from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.
Other than breaking the data dependency, if f()
inlines and variable
optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)
Note that later reloads of variable
can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f()
didn't inline and use the new_value
load result directly, its use of variable
still doesn't have to wait for a possible store miss on variable
.
Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.
Two questions linked in comments by @EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic
variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst
memory order are slow on most ISAs other than AArch64, draining the store buffer.)
new_val
, which will require fetching it from cache if needed, whereas the compiler is allowed to completely disregard previous values in (2). I'd be surprised if (1) is faster unless the type ofvariable
has a largesizeof()
or has some side-effect producing assignment operations. But as always: don't assume, benchmark. – Outrageousvariable
represents an atomic variable of shared data between several threads with high thread contention, I would guess that alternative #1 is faster, because writes to shared data are very cache-unfriendly. However, in most situations, I would say that alternative #2 is faster. Either way, I recommend that you use benchmarking to determine which method is best in your particular situation. – Austerevariable
can be placed into a register and thus affects whether the variable is cached or not. In my understanding, registers don't involve using the cache, except to load and store values. Thus there is a possibility thatf()
doesn't use the cache because the value is still in a registers. Depends on when thevariable
is used inf()
and how the compiler generated the instructions. – Rossiyaint
oruint64_t
, whatever it typedefs to. – Busterbustle