How does memory reordering help processors and compilers?

Asked 9/6, 2016 at 12:4 Answered 10/6, 2016 at 4:29

Solved java multithreading optimization compiler-optimization cpu-architecture

I studied the Java memory model and saw re-ordering problems. A simple example:

boolean first = false;
boolean second = false;

void setValues() {
    first = true;
    second = true;
}

void checkValues() {
    while(!second);
    assert first;
}

Reordering is very unpredictable and weird. Also, it ruins abstractions. I suppose that processor architectures must have a good reason to do something that's so inconvenient for programmers. What are those reasons?

There is a lot of information about how to handle reordering, but I can't find anything about why it is needed. Everywhere people just say something like "it is because of some performance benefit". What are the performance benefits in storing second before first, for example?

Can you recommend some article, paper or book about this, or explain it by yourself?

Maible answered 9/6, 2016 at 12:4 Comment(5)

Walk into a cafe and ask for a drink and a sandwich. The person behind the counter hands you the sandwich (which is right next to him), then walks to the fridge to get your drink. Do you care that he gave them to you in the "wrong" order? Would you rather he did the slow one first, simply because that's how you gave the order? – Morelos 9/6, 2016 at 12:9

Occasionally it does matter though. You wouldn't want a warm drink on a hot day would you? So you'd like the drink to be fetched last. – Brummell 9/6, 2016 at 12:15

is your code supposed to do anything else than immediately throwing an Exception? Im guessing you dont really understand the term "reordering", stored values never change but their FETCHING STRATEGY will. – Walachia 9/6, 2016 at 12:23

Modern CPUs are complex devices, that can execute multiple instructions at the same time, if there's no data dependency between the instructions. Depending on the CPU, putting the instructions in a certain order other than what you did in the source code will make it run faster. See Out-of-order execution. – Rangefinder 9/6, 2016 at 12:51

@Jesper: Compile-time reordering more importantly allows multiple operations on the same shared variable to be folded together. e.g. multiple increments from multiple calls to a function that increments something can turn into a single c.a += 4 after inlining, even things happen in between and the compiler can't prove that no other thread could observe them (via a reference). See my update to my answer. – Blackstock 10/6, 2016 at 19:11

TL;DR: It gives the compiler and hardware more room to take advantage of the as-if rule by not requiring it to preserve all behaviour of the original source, only the result of the single thread itself.

Taking the externally-observable (from other threads) ordering of loads/stores out of the picture as something that optimizations must preserve gives the compiler a lot of room to merge things into fewer operations. For the hardware, delaying stores is the big one, but for compilers all kinds of reordering can help.

(Letting cache-hit loads finish before earlier cache-miss loads is also useful for hardware, and same for stores, without having to do load-order speculation with possible roll-back. Jeff Preshing's Memory Barriers Are Like Source Control Operations has useful analogies and descriptions of plausible ways hardware can do it. But modern x86 CPUs throw enough transistors and power at these ordering requirements to manage them without huge performance downsides in most code.)

(See part-way down for a section on why it helps the compiler)

Why it helps hardware

Hardware reordering earlier stores with later loads (StoreLoad reordering) inside the CPU is essential for out-of-order execution. (See below).

Other kinds of reordering (e.g. StoreStore reordering, which is the subject of your question) aren't essential, and high performance CPUs can be built with only StoreLoad reordering, not the other three kinds. (The prime example being tag:x86, where every store is a release-store, every load is an acquire-load. See the x86 tag wiki for more details.)

Some people, like Linus Torvalds, argue that reordering stores with other stores doesn't help the hardware much, because hardware already has to track store-ordering to support out-of-order execution of a single thread. (A single thread always runs as if all of its own stores/loads happen in program order.) See other posts in that thread on realworldtech if you're curious. And/or if you find Linus's mix of insults and sensible technical arguments entertaining :P

For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. Weak memory ordering is a common feature of RISC ISAs like ARM, PowerPC, and MIPS. (But not SPARC-TSO). The reasons behind that design decision are the same ones being argued over in the realworldtech thread I linked: make the hardware simpler, and let software request ordering when needed.

So Java's architect(s) didn't have much of a choice: Implementing a JVM for an architecture with a weaker memory model than the Java standard would require a store-barrier instruction after every single store, and a load-barrier before every load. (Except when the JVM's JIT-compiler can prove that no other thread can have a reference to that variable.) Running barrier instructions all the time is slow.

A strong memory model for Java would make efficient JVMs on ARM (and other ISAs) impossible. Proving that barriers aren't needed is near-impossible, requiring AI levels of global program-understanding. (This goes WAY beyond what normal optimizers do).

Why it helps compilers

(see also Jeff Preshing's excellent blog post on C++ compile-time reordering. This basically applies to Java when you include JIT-compiling to native code as part of the process.)

Another reason for keeping the Java and C/C++ memory models weak is to allow more optimizations. Since other threads are allowed (by the weak memory model) to observe our stores and loads in any order, aggressive transformations are allowed even when the code involves stores to memory.

e.g. in a case like Davide's example:

c.a = 1;
c.b = 1;
c.a++;
c.b++;

// same observable effects as the much simpler
c.a = 2;
c.b = 2;

There's no requirement that other threads be able to observe the intermediate states. So a compiler can just compile that to c.a = 2; c.b = 2;, either at Java-compile time or when the bytecode is JIT-compiled to machine code.

It's common for a method that increments something to be called multiple times from another method. Without this rule, turning it into c.a += 4 could only happen if the compiler could prove that no other thread could observe the difference.

C++ programmers sometimes make the mistake of thinking that since they're compiling for x86, they don't need std::atomic<int> to get some ordering guarantees for a shared variable. This is wrong, because optimizations happen based on the as-if rule for the language memory model, not the target hardware.

More technical hardware explanations:

Why StoreLoad reordering helps performance:

Once a store is committed into cache, it becomes globally visible to threads running on other cores (via the cache-coherency protocol). At that point, it's too late to roll it back (another core might already have gotten a copy of the value). So it can't happen until it's known for certain that the store won't fault, and neither will any instruction before it (i.e. at or after retirement from the out-of-order back-end). And that there wasn't a branch-mispredict at some point earlier, etc. etc. i.e. we need to rule out all cases of mis-speculation before we can commit a store instruction to L1d cache. (This is why we have a store buffer)

Without StoreLoad reordering, every load would have to wait for all preceding stores to retire (i.e. be totally finished executing, known non-speculative) and actually commit the data to cache, before they could read a value from cache for use by later instructions that depend on the value loaded. (The moment when a load copies a value from cache into a register is the critical time when it "happens" as part of the coherency order of loads and stores on that memory location.)

Normally CPUs commit stores from the store-buffer to L1d cache after the corresponding store instruction has retired from the out-of-order back-end (ReOrder Buffer = ROB). Some amount of "graduated" stores can be in the store buffer, so this decouples execution from cache-miss stores. But you could give up that benefit and make commit to L1d happen as part of retirement. (Execution of a store would still work by writing the address+data into the store buffer, so it can be speculative, happening when the data is ready. The store buffer keeps this speculation private to the core.)

Without StoreLoad reordering, loads couldn't execute until all earlier stores had committed to cache. This would be a huge roadblock for memory parallelism. i.e. every load would be like an x86 lfence, draining the out-of-order back-end, and like mfence waiting for the store buffer to empty since we're proposing that commit would happen at retirement, not after. Including waiting for any earlier cache miss loads or stores, and waiting for the CPU to chew through all dependency chains, so it would tend to serialize things instead of letting the CPU overlap independent iterations of a loop, or other long dep chains.

Modern x86 CPUs do speculative early loads ahead of other (cache-miss) loads, potentially taking a memory-order mis-speculation pipeline nuke if they detect that their copy of the cache line didn't stay valid from when the load actually happened to when it's architecturally allowed. In that case they discard the ROB contents to roll back to a consistent retirement state and start executing again. (This normally only happens when another core is modifying a cache line, but can also happen if it incorrectly predicted that a load wouldn't reload a store.) (Of course real x86 can freely reorder loads ahead of stores.)

If StoreLoad reordering wasn't allowed, a CPU could still do loads speculatively, but would probably still have to commit stores earlier than normal CPUs. Load speculation could track stores in the store buffer post-retirement.

So really the limitation would be that loads can't retire until earlier stores commit. Those stores can still stay in the store buffer after they retire (become non-speculative). Which doesn't sound that bad on modern CPUs with huge ROBs and large store buffers, but it would be a disaster for in-order CPUs, or for more modest out-of-order execution capabilities in CPUs that existed when memory models were being designed.

Even with huge out-of-order exec capabilities, it introduces quite a lot more speculation, or a larger time-window for speculation, where the CPU might need to nuke the pipeline (discard the ROB). With a large ROB / out-of-order state, that's a lot of work potentially lost. In parallel code that accesses shared memory, this could be bad. (Including false sharing, where two variables happen to be in the same cache line). Penalties for this are already quite substantial and happen even with just loads on current x86 CPUs. (And aren't great even on other ISAs where load ordering isn't required, but the cache-line ping-pong is still a problem).

And cache-miss stores couldn't be hidden as effectively. If your code isn't doing many stores, one that misses in cache can sit in the store buffer for the hundreds of cycles of latency it might take to get the line from DRAM. (Or to get exclusive ownership via a read-for-ownership if the line was dirty in another core, and other cores are contending to read or write it.) If there aren't a lot of stores in the code being executed, it could potentially get hundreds of instructions ahead of the store before the store commits, including multiple loads that come and go. But if all those later loads couldn't retire from the ROB until stores commit, it would stop new potentially-independent instructions from entering the out-of-order back-end and being scheduled to execution units during this delay.

(Most code does quite a bit of storing, though, and a store buffer would quickly fill up. Except on weakly-ordered ISAs that allow StoreStore reordering, so on cache-miss store doesn't bottleneck later stores that hit in cache.)

(I rewrote the above section after realizing that x86 CPUs do speculatively load early, and could apply that to a hypothetical StoreLoad rule as well as x86's actual LoadLoad ordering rule. (Program order + a store buffer with store-forwarding). Some of this section may now be redundant with that.)

How actual CPUs work (when StoreLoad reordering is allowed):

I included some links as part of a brief intro to computer architecture in the early part of my answer on Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. That might be helpful, or more confusing, if you're finding this hard to follow.

CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!). The store queue is invisible to other threads; stores only become globally visible when the store instruction retires, but loads become globally visible as soon as they execute. (And can use values prefetched into the cache well ahead of that).

See also this answer I wrote explaining store buffers and how they decouple execution from cache-miss store commit, and allow speculative execution of stores. Also wikipedia's article on the classic RISC pipeline has some stuff for simpler CPUs. A store-buffer inherently creates StoreLoad reordering (and also store-forwarding so a core can see its own stores before they become globally visible, assuming the core can do store forwarding instead of stalling.)

So out-of-order execution is possible for stores, but they're only reordered inside the store queue. Since instructions have to retire in order to support precise exceptions, there doesn't appear to be much benefit at all to having the hardware enforce StoreStore ordering.

Since loads become globally visible when they execute, enforcing LoadLoad ordering may require delaying loads after a load that misses in cache. Of course, in reality the CPU would speculatively execute the following loads, and detect a memory-order mis-speculation if it occurs. This is nearly essential for good performance: A large part of the benefit of out-of-order execution is to keep doing useful work, hiding the latency of cache misses.

One of Linus' arguments is that weakly-ordered CPUs require multi-threaded code to use a lot of memory barrier instructions, so they'll need to be cheap for multi-threaded code to not suck. That's only possible if you have hardware tracking the dependency ordering of loads and stores.

But if you have that hardware tracking of dependencies, you can just have the hardware enforce ordering all the time, so software doesn't have to run as many barrier instructions. If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does.

His other major argument is that memory ordering is HARD, and a major source of bugs. Getting it right once in hardware is better than every software project having to get it right. (This argument only works because it's possible in hardware without huge performance overhead.)

Blackstock answered 10/6, 2016 at 4:29 Comment(21)

thanks for your great answer. I don't understand what do you mean in paragraph: "For Java, the issue is that...". Especially, you said: "For Java, the issue is that, architectures exist where the hardware doesn't provide these ordering guarantees. " I don't know why it is a problem for Java. I know that Java is compiled to bytecode and executed by JVM. After all, these architectures guarantes different memory model also for other languages, like C++ as well. So, why is it a problem with Java ( JVM) ? – Arakawa 15/7, 2016 at 16:29

And second issue: You said: "So Java doesn't have much of a choice:.." Do you mean that different memory model forces programmers to implement different JVM for different architectures? – Arakawa 15/7, 2016 at 16:29

@Gilgamesz: I was talking about the design decisions that Java's architect(s) had to make about Java's memory model. Java would be easier to program if it provided a strong memory model instead of requiring explicit ordering semantics, but that would make it impossible to implement a high-performance JVM on weakly-ordered hardware. (As well as seriously limiting the compile-time optimizer). – Blackstock 15/7, 2016 at 19:15

1. "but that would make it impossible to implement a high-performance JVM on weakly-ordered hardware. ". I understand that it is casued by the fact, that on weakly-ordered hardware JVM would "emulate" strong memory model by putting memory barriers in "every place", perhaps unnecessary. Am I right? 2. "Java would be easier to program". You mean obviously lock-free programming because when we use locks ( synchronized) we don't care of about ordering in explicit way ( I mean memory barrier). Yeah? – Arakawa 15/7, 2016 at 19:33

@Gilgamesz: 1: yes, exactly like I said in my answer, right after the bit you quoted. The JVM would need AI levels of cleverness to figure out which operations actually need barriers, so it would have to use extra barriers everywhere. – Blackstock 15/7, 2016 at 19:39

@Gilgamesz: 2: yes, locking gives you acquire/release semantics. Taking a lock is an acquire barrier. But even in code that takes a lock, the JVM doesn't know that it doesn't also depend on implicit strong ordering. (That would be weird, but possible). – Blackstock 15/7, 2016 at 19:41

@Gilgamesz: I just made an edit. Does that help make the answer more clear for future readers? I have a hard time imagining what it's like to not know all the stuff I know, or to have a different way of thinking about things. – Blackstock 15/7, 2016 at 19:51

Ok, thanks for your edit. It is more clear, I think so. Generally, your posts are excellent and if I don't understand something it means that the problem is with me, not your great explanation :). – Arakawa 15/7, 2016 at 20:4

@Gilgamesz: hehe, I'd agree with that, but it's always nice to get feedback on things that people didn't find clear. If I can easily explain it more clearly, then I will. Other times, it's an issue of needing other knowledge to understand the explanation, and then I just link to wikipedia or something. – Blackstock 15/7, 2016 at 20:14

"CPUs avoid WAR and WAW pipeline hazards for stores by buffering them in a store queue until store instructions are ready to retire. Loads from the same core have to check the store queue (to preserve the appearance of in-order execution for a single thread, otherwise you'd need memory-barrier instructions before loading anything that might have been stored recently!)." So it means, that stores must be placed in store queue in program order, does it? – Arakawa 15/7, 2016 at 20:18

" If you have hardware support to make barriers cheap, why not just make them implicit on every load / store, like x86 does." It is not clear. x86 does implicit on every load / store memory barrier? If so, what about StoreLoad reordering allowed by x86? What does it mean that it is cheap? – Arakawa 15/7, 2016 at 20:31

@Gilgamesz: No. Stores can execute out of order. This is one of those cases where you should just read the links (e.g. en.wikipedia.org/wiki/Memory_disambiguation) for the necessary background info. The store queue keeps track of them, with ordering information, until they're ready to retire and become globally visible. On x86, they become globally visible in program order, but not on weakly-ordered architectures. – Blackstock 15/7, 2016 at 21:10

@Gilgamesz: cheap barriers: I'm talking about StoreStore, LoadLoad, and LoadStore barriers. The ones that are implicit on x86, and are cheaper than full barriers on high-performance implementations of weakly-ordered architectures. Those barriers can be made cheap because the necessary ordering info has to be tracked anyway to correctly execute a single thread out of order StoreLoad barriers can't be made that cheap. re: how cheap is cheap? IDK exactly. But much cheaper than a full barrier that flushes the queue. – Blackstock 15/7, 2016 at 21:12

@PeterCordes re I don't think hardware could hide this delay in starting loads by speculating that it isn't a problem, and then detecting mis-speculation after the fact. Interesting, I'm curious why e.g. Intel won't implement such a mechanism to preclude StoreLoad reordering, after all they have a similar memory order machine clear for LoadLoad reordering speculation? – Whisenant 11/11, 2021 at 22:49

@DanielNitzan: IDK what my thought process was 5 years ago when I wrote that; I was pretty new at that point to thinking about these things. Currently, I think you might be right that a core could speculatively load early, and make sure the cache line wasn't invalidated between the actual load time and the earliest time the load was architecturally allowed to happen (after the store committed). That could still lead to some stalls, though, if the store still hasn't committed when this load would otherwise be ready to retire. And of course lead to possible machine clears. – Blackstock 11/11, 2021 at 23:3

@DanielNitzan: Finally got back to this answer and rewrote that section. With speculative load tracking and large ROB + store buffer, a seq_cst memory model might not be a total disaster for a modern CPU, at least when accessing non-shared memory. – Blackstock 26/7, 2022 at 15:46

@PeterCordes re your example in "Why it helps compilers", I think atomic mo_relaxed store and fetch_add are enough to kill the optimization (it's not a reordering problem, rather non-atomic access?). godbolt.org/z/c84fY9GaM VS godbolt.org/z/sor41TrY5 – Whisenant 29/7, 2022 at 8:26

@DanielNitzan: Not sure what that proves; of course an atomic RMW is slow, and compilers don't optimize atomics. The load and store are tied together by being an RMW. Not a separate .load and .store. Also, you don't have any surrounding operations that can reorder across the fetch_add to do dead-store elimination or CSE. – Blackstock 29/7, 2022 at 17:21

@PeterCordes Right, my examples aren't practical because compilers don't optimize atomics anyway. ISO C++ allows store/rmw folding/collapsing optimizations even for atomics, under any memory ordering. The point I was trying to make is that these optimizations are not considered store/load reordering according to CPP's definition of memory ordering? – Whisenant 2/8, 2022 at 9:14

@DanielNitzan: Still not sure what point you're making. Nothing else can sync-with that specific load or RMW because atomic<int> a is a local variable and no reference to it escapes the function. So yes, it could be totally optimized away, or one or both of the operations removed, or merged together. With orders stronger than relaxed, on shared variables, you could maybe have a case where some atomic ops (especially seq_cst) keep other surrounding non-atomic operations from combining with each other if they operate on potentially-shared memory. – Blackstock 2/8, 2022 at 9:22

@DanielNitzan: I wouldn't call it reordering to remove something that can't possibly be observed by another thread in the first place, though. Just optimization. Removing stuff in asm can potentially allow more run-time reordering of other operations in the program. – Blackstock 2/8, 2022 at 9:24

Imagine to have the following code:

a = 1;
b = 1;
a = a + 1;   // Not present in the register
b = b + 1;   // Not present in the register
a = a + 1;   // Not present in the register
b = b + 1;   // Not present in the register
// Here both a and b has value 3

A possible optimization using memory reorder is

a = 1;
a = a + 1;   // Already in the register
a = a + 1;   // Already in the register
b = 1;
b = b + 1;   // Already in the register
b = b + 1;   // Already in the register
// Here both a and b has value 3

The performance is better because the data are presents in the register.

Note that there are many different levels of optimization, but this will give you an idea why reordering can improve performances.

Bellamy answered 9/6, 2016 at 12:11 Comment(3)

This is about memory ordering, not registers. Are a and b supposed to be locals? And you're saying that on a machine with a single accumulator register, loading b requires spilling a? – Blackstock 10/6, 2016 at 4:38

The real optimization is to do one store that sets a=3, because after reordering you can combine the separate a = a + 1. (same for b). If that reordering wasn't allowed, another thread could never observe |a-b| > 1. But since it could legally observe that in the Java memory model, the optimizer can rearrange the program to make it more efficient while still producing the same externally-observable effects. – Blackstock 10/6, 2016 at 4:42

@PeterCordes obviously. Infact I added the note at the end. But this can give an idea on how reorder can affect performances. A real optimization can render the problem difficult to be read. – Bellamy 10/6, 2016 at 10:4

On a modern processor chip, the processor can typically perform register to register operations an order of magnitude (or more) faster that fetching from main memory. Operations that hit the L1 or L2 caches are faster than main memory, slower than register to register. The other thing to note is that modern processors chips typically use a pipeline which allows different parts of different instructions to be executed at the same time.

With this in mind, reordering of operations is typically done to avoid situations where the pipeline (fast) has to wait for an operation on main memory (slow) to complete:

Davide's example illustrates reordering that avoids memory reads and writes entirely. (At least, that is his intention. In reality, the reordering is done at the native instruction level, not the source code or bytecode level.)
In other cases, you might find that the instructions to do a = a + 1 and b = b + 1 get interleaved; e.g.
```
1) load a -> r1
2) load b -> r2
3) r1 + 1 -> r3
4) r2 + 1 -> r4
5) save r3 -> a
6) save r4 -> b
```
In a pipeline architecture, this might allow 2) and 3) to happen at the same time, 4) and 5) to happen at the same time and so on.

The final thing to note is that a modern processor chip / instruction set avoids reading from main memory and writing to main memory as much as possible. Indeed, it is common for a write instruction to write into L1 or L2 cache, and delay the (slow) write to main memory until the cache-line is flushed. This leads to a different kind of "memory anomaly" ... where a separate thread running on a different core does not see memory updates because the respective writes have not (yet) been flushed.

The Java Memory Model is designed to allow the compiler / processor to optimize performance of a multi-threaded application, as above. It makes it clear when one thread is guaranteed to see memory changes made by another thread. The compiler / processor are allowed to reorder, etc in cases where are no visibility guarantees. This reordering can make a big difference in overall performance.

Hajji answered 9/6, 2016 at 12:50 Comment(2)

+1 Scheduling memory io to avoid conflicts can be very important. Theres more than just lowering register pressure. – Senega 9/6, 2016 at 16:31

SMP systems are cache-coherent. Once a store is committed to L1 cache, it's globally visible. StoreLoad reordering happens because stores get buffered in a private store queue before committing them to the cache, to enable out-of-order execution. And even a modern in-order CPU will still support some buffering of stores to hide latency. – Blackstock 10/6, 2016 at 4:34

Walk into a cafe and ask for a drink and a sandwich. The person behind the counter hands you the sandwich (which is right next to him), then walks to the fridge to get your drink.

Do you care that he gave them to you in the "wrong" order? Would you rather he did the slow one first, simply because that's how you gave the order?

Well, maybe you do care. Maybe you want to stuff the uneaten sandwich into your empty drink cup (you paid for them, so why not, if you want to). You are frustrated by the fact you have to hold the sandwich while your drink is fetched - you could have used that time to drink your drink, after all, and you wouldn't end up with hiccups, because you're in a hurry!

But that's what happens if you order a few things without specifying the order in which they must happen. The server isn't aware of your unusual sandwich-cup-stuffing habit, and so it seems to them like the ordering doesn't matter.

We have constructs in natural language to specify the ordering ("Please give me a drink, then give me a sandwich") or not ("Please give me a drink and a sandwich"). If you're not careful to use the former rather than the latter, it will be assumed that you just want the end result, and the various steps can be reordered for convenience's sake.

Similarly, in the JMM, if you're not specific about the ordering of operations, it is assumed that the operations can be reordered.

Morelos answered 9/6, 2016 at 12:18 Comment(1)

I like the idea of the analogy, but unfortunately this one isn't quite perfect. The golden rule of out-of-order execution is: never break a single-threaded program. i.e. a single-thread always appears to execute in program order. Same at the Java source-code level; You don't have to do anything to specify that a = 1 will never be reordered with b = a. Reordering only affects what other threads observe. – Blackstock 10/6, 2016 at 18:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++