TL;DR: It gives the compiler and hardware more room to take advantage of the as-if rule by not requiring it to preserve all behaviour of the original source, only the result of the single thread itself.
Taking the externally-observable (from other threads) ordering of loads/stores out of the picture as something that optimizations must preserve gives the compiler a lot of room to merge things into fewer operations. For the hardware, delaying stores is the big one, but for compilers all kinds of reordering can help.
(Letting cache-hit loads finish before earlier cache-miss loads is also useful for hardware, and same for stores, without having to do load-order speculation with possible roll-back. Jeff Preshing's Memory Barriers Are Like Source Control Operations has useful analogies and descriptions of plausible ways hardware can do it. But modern x86 CPUs throw enough transistors and power at these ordering requirements to manage them without huge performance downsides in most code.)
(See part-way down for a section on why it helps the compiler)
Why it helps hardware
Hardware reordering earlier stores with later loads (StoreLoad reordering) inside the CPU is essential for out-of-order execution. (See below).
Other kinds of reordering (e.g. StoreStore reordering, which is the subject of your question) aren't essential, and high performance CPUs can be built with only StoreLoad reordering, not the other three kinds. (The prime example being tag:x86, where every store is a release-store, every load is an acquire-load. See the x86 tag wiki for more details.)
Some people, like Linus Torvalds, argue that reordering stores with other stores doesn't help the hardware much, because hardware already has to track store-ordering to support out-of-order execution of a single thread. (A single thread always runs as if all of its own stores/loads happen in program order.) See other posts in that thread on realworldtech if you're curious. And/or if you find Linus's mix of insults and sensible technical arguments entertaining :P
For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. Weak memory ordering is a common feature of RISC ISAs like ARM, PowerPC, and MIPS. (But not SPARC-TSO). The reasons behind that design decision are the same ones being argued over in the realworldtech thread I linked: make the hardware simpler, and let software request ordering when needed.
So Java's architect(s) didn't have much of a choice: Implementing a JVM for an architecture with a weaker memory model than the Java standard would require a store-barrier instruction after every single store, and a load-barrier before every load. (Except when the JVM's JIT-compiler can prove that no other thread can have a reference to that variable.) Running barrier instructions all the time is slow.
A strong memory model for Java would make efficient JVMs on ARM (and other ISAs) impossible. Proving that barriers aren't needed is near-impossible, requiring AI levels of global program-understanding. (This goes WAY beyond what normal optimizers do).
Why it helps compilers
(see also Jeff Preshing's excellent blog post on C++ compile-time reordering. This basically applies to Java when you include JIT-compiling to native code as part of the process.)
Another reason for keeping the Java and C/C++ memory models weak is to allow more optimizations. Since other threads are allowed (by the weak memory model) to observe our stores and loads in any order, aggressive transformations are allowed even when the code involves stores to memory.
e.g. in a case like Davide's example:
c.a = 1;
c.b = 1;
c.a++;
c.b++;
// same observable effects as the much simpler
c.a = 2;
c.b = 2;
There's no requirement that other threads be able to observe the intermediate states. So a compiler can just compile that to c.a = 2; c.b = 2;
, either at Java-compile time or when the bytecode is JIT-compiled to machine code.
It's common for a method that increments something to be called multiple times from another method. Without this rule, turning it into c.a += 4
could only happen if the compiler could prove that no other thread could observe the difference.
C++ programmers sometimes make the mistake of thinking that since they're compiling for x86, they don't need std::atomic<int>
to get some ordering guarantees for a shared variable. This is wrong, because optimizations happen based on the as-if rule for the language memory model, not the target hardware.
More technical hardware explanations:
Why StoreLoad reordering helps performance:
Once a store is committed into cache, it becomes globally visible to threads running on other cores (via the cache-coherency protocol). At that point, it's too late to roll it back (another core might already have gotten a copy of the value). So it can't happen until it's known for certain that the store won't fault, and neither will any instruction before it (i.e. at or after retirement from the out-of-order back-end). And that there wasn't a branch-mispredict at some point earlier, etc. etc. i.e. we need to rule out all cases of mis-speculation before we can commit a store instruction to L1d cache. (This is why we have a store buffer)
Without StoreLoad reordering, every load would have to wait for all preceding stores to retire (i.e. be totally finished executing, known non-speculative) and actually commit the data to cache, before they could read a value from cache for use by later instructions that depend on the value loaded. (The moment when a load copies a value from cache into a register is the critical time when it "happens" as part of the coherency order of loads and stores on that memory location.)
Normally CPUs commit stores from the store-buffer to L1d cache after the corresponding store instruction has retired from the out-of-order back-end (ReOrder Buffer = ROB). Some amount of "graduated" stores can be in the store buffer, so this decouples execution from cache-miss stores. But you could give up that benefit and make commit to L1d happen as part of retirement. (Execution of a store would still work by writing the address+data into the store buffer, so it can be speculative, happening when the data is ready. The store buffer keeps this speculation private to the core.)
Without StoreLoad reordering, loads couldn't execute until all earlier stores had committed to cache. This would be a huge roadblock for memory parallelism. i.e. every load would be like an x86 lfence
, draining the out-of-order back-end, and like mfence
waiting for the store buffer to empty since we're proposing that commit would happen at retirement, not after. Including waiting for any earlier cache miss loads or stores, and waiting for the CPU to chew through all dependency chains, so it would tend to serialize things instead of letting the CPU overlap independent iterations of a loop, or other long dep chains.
Modern x86 CPUs do speculative early loads ahead of other (cache-miss) loads, potentially taking a memory-order mis-speculation pipeline nuke if they detect that their copy of the cache line didn't stay valid from when the load actually happened to when it's architecturally allowed. In that case they discard the ROB contents to roll back to a consistent retirement state and start executing again. (This normally only happens when another core is modifying a cache line, but can also happen if it incorrectly predicted that a load wouldn't reload a store.) (Of course real x86 can freely reorder loads ahead of stores.)
If StoreLoad reordering wasn't allowed, a CPU could still do loads speculatively, but would probably still have to commit stores earlier than normal CPUs. Load speculation could track stores in the store buffer post-retirement.
So really the limitation would be that loads can't retire until earlier stores commit. Those stores can still stay in the store buffer after they retire (become non-speculative). Which doesn't sound that bad on modern CPUs with huge ROBs and large store buffers, but it would be a disaster for in-order CPUs, or for more modest out-of-order execution capabilities in CPUs that existed when memory models were being designed.
Even with huge out-of-order exec capabilities, it introduces quite a lot more speculation, or a larger time-window for speculation, where the CPU might need to nuke the pipeline (discard the ROB). With a large ROB / out-of-order state, that's a lot of work potentially lost. In parallel code that accesses shared memory, this could be bad. (Including false sharing, where two variables happen to be in the same cache line). Penalties for this are already quite substantial and happen even with just loads on current x86 CPUs. (And aren't great even on other ISAs where load ordering isn't required, but the cache-line ping-pong is still a problem).
And cache-miss stores couldn't be hidden as effectively. If your code isn't doing many stores, one that misses in cache can sit in the store buffer for the hundreds of cycles of latency it might take to get the line from DRAM. (Or to get exclusive ownership via a read-for-ownership if the line was dirty in another core, and other cores are contending to read or write it.) If there aren't a lot of stores in the code being executed, it could potentially get hundreds of instructions ahead of the store before the store commits, including multiple loads that come and go. But if all those later loads couldn't retire from the ROB until stores commit, it would stop new potentially-independent instructions from entering the out-of-order back-end and being scheduled to execution units during this delay.
(Most code does quite a bit of storing, though, and a store buffer would quickly fill up. Except on weakly-ordered ISAs that allow StoreStore reordering, so on cache-miss store doesn't bottleneck later stores that hit in cache.)
(I rewrote the above section after realizing that x86 CPUs do speculatively load early, and could apply that to a hypothetical StoreLoad rule as well as x86's actual LoadLoad ordering rule. (Program order + a store buffer with store-forwarding). Some of this section may now be redundant with that.)
How actual CPUs work (when StoreLoad reordering is allowed):
I included some links as part of a brief intro to computer architecture in the early part of my answer on Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. That might be helpful, or more confusing, if you're finding this hard to follow.
CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!). The store queue is invisible to other threads; stores only become globally visible when the store instruction retires, but loads become globally visible as soon as they execute. (And can use values prefetched into the cache well ahead of that).
See also this answer I wrote explaining store buffers and how they decouple execution from cache-miss store commit, and allow speculative execution of stores. Also wikipedia's article on the classic RISC pipeline has some stuff for simpler CPUs. A store-buffer inherently creates StoreLoad reordering (and also store-forwarding so a core can see its own stores before they become globally visible, assuming the core can do store forwarding instead of stalling.)
So out-of-order execution is possible for stores, but they're only reordered inside the store queue. Since instructions have to retire in order to support precise exceptions, there doesn't appear to be much benefit at all to having the hardware enforce StoreStore ordering.
Since loads become globally visible when they execute, enforcing LoadLoad ordering may require delaying loads after a load that misses in cache. Of course, in reality the CPU would speculatively execute the following loads, and detect a memory-order mis-speculation if it occurs. This is nearly essential for good performance: A large part of the benefit of out-of-order execution is to keep doing useful work, hiding the latency of cache misses.
One of Linus' arguments is that weakly-ordered CPUs require multi-threaded code to use a lot of memory barrier instructions, so they'll need to be cheap for multi-threaded code to not suck. That's only possible if you have hardware tracking the dependency ordering of loads and stores.
But if you have that hardware tracking of dependencies, you can just have the hardware enforce ordering all the time, so software doesn't have to run as many barrier instructions. If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does.
His other major argument is that memory ordering is HARD, and a major source of bugs. Getting it right once in hardware is better than every software project having to get it right. (This argument only works because it's possible in hardware without huge performance overhead.)
c.a += 4
after inlining, even things happen in between and the compiler can't prove that no other thread could observe them (via a reference). See my update to my answer. – Blackstock