Are loads and stores the only instructions that gets reordered?

Asked 23/5, 2018 at 17:57 Answered 23/5, 2018 at 19:47

I have read many articles on memory ordering, and all of them only say that a CPU reorders loads and stores.

Does a CPU (I'm specifically interested in an x86 CPU) only reorders loads and stores, and does not reorder the rest of the instructions that it have?

Alluvion answered 23/5, 2018 at 17:57 Comment(1)

It reorders other instructions too, but you can't observe that effect since the cpu guarantees the same visible result. See also Out of order execution on wikipedia – Duckboard 23/5, 2018 at 17:59

Out-of-order execution preserves the illusion of running in program order for a single thread/core. This is like the C/C++ as-if optimization rule: do whatever you want internally as long as visible effects are the same.

Separate threads can only communicate with each other via memory, so the global order of memory operations (loads/stores) is the only externally visible side-effect of execution¹.

Even in-order CPUs can have their memory operations become globally visible out of order. (e.g. even a simple RISC pipeline with a store buffer will have StoreLoad reordering, like x86). A CPU that starts loads / stores in-order but allows them to complete out of order (to hide cache-miss latency) could also reorder loads if it doesn't specifically avoid it (or like modern x86, execute aggressively out-of-order but pretend that it doesn't by tracking memory ordering carefully).

A simple example: two ALU dependency chains can overlap

(related: http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about how big the window is for finding instruction-level parallelism, e.g. if you increased this to times 200 you'd see only limited overlap. Also related: this beginner to intermediate-level answer I wrote about how an OoO CPU like Haswell or Skylake finds and exploits ILP.)

See also Modern Microprocessors A 90-Minute Guide! for an excellent into to superscalar and out-of-order exec CPUs.

For a much deeper analysis of the impact of lfence here, see Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths

global _start
_start:
    mov  ecx, 10000000
.loop:
    times 25 imul eax,eax   ; expands to imul eax,eax  / imul eax,eax / ...
 ;   lfence
    times 25 imul edx,edx
 ;   lfence
    dec  ecx
    jnz  .loop

    xor  edi,edi
    mov  eax,231
    syscall          ; sys_exit_group(0)

built (with nasm + ld) into a static executable on x86-64 Linux, this runs (on Skylake) in the expected 750M clock cycles for each chain of 25 * 10M imul instructions times 3 cycle latency.

Commenting out one of the imul chains doesn't change the time it takes to run: still 750M cycles.

This is definite proof of out-of-order execution interleaving the two dependency chains, otherwise . (imul throughput is 1 per clock, latency 3 clocks. http://agner.org/optimize/. So a third dependency chain could be mixed in without much slowdown).

Actual numbers from taskset -c 3 ocperf.py stat --no-big-num -etask-clock,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,uops_retired.retire_slots:u -r3 ./imul:

with both imul chains: 750566384 +- 0.1%
with only the EAX chain: 750704275 +- 0.0%
with one times 50 imul eax,eax chain: 1501010762 +- 0.0% (almost exactly twice as slow, as expected).
with lfence preventing overlap between each block of 25 imul: 1688869394 +- 0.0%, worse than twice as slow. uops_issued_any and uops_retired_retire_slots are both 63M, up from 51M, while uops_executed_thread is still 51M (lfence doesn't use any execution ports, but apparently two lfence instructions cost 6 fused-domain uops each. Agner Fog only measured 2.)

(lfence serializes instruction execution, but not memory stores). If you aren't using NT loads from WC memory (which won't happen by accident), it's a no-op other than stopping later instructions from executing until previous instructions have "completed locally". i.e. until they've retired from the out-of-order core. This is probably why it more than doubles the total time: it has to wait for the last imul in a block to go through more pipeline stages.)

lfence on Intel is always like that, but on AMD it's only partially-serializing with Spectre mitigation enabled.

Footnote 1: There are also timing side-channels when two logical threads share one physical thread (hyperthreading or other SMT). e.g. executing a sequence of independent imul instructions will run at 1 per clock on a recent Intel CPU, if the other hyperthread doesn't need port 1 for anything. So you can measure how much port 0 pressure there is by timing an ALU-bound loop on once logical core.

Other micro-architectural side-channels, such as cache accesses, are more reliable. For example, Spectre / Meltdown are easiest to exploit with a cache-read side-channel, rather than ALU.

But all of these side-channels are finicky and unreliable compared to architecturally-supported reads/writes to shared memory, so they're only relevant for security. They're not used intentionally within the same program for communicating between threads.

MFENCE on Skylake is an OoO exec barrier like LFENCE

mfence on Skylake unexpectedly blocks out-of-order execution of imul, like lfence, even though it's not documented to have that effect. (See the moved-to-chat discussion for more).

xchg [rdi], ebx (implicit lock prefix) doesn't block out-of-order execution of ALU instructions at all. The total time is still 750M cycles when replacing lfence with xchg or a locked instruction in the above test.

But with mfence, the cost goes up to 1500M cycles + the time for 2 mfence instructions. To do a controlled experiment, I kept the instruction-count the same but moved the mfence instructions next to each other, so the imul chains could reorder with each other, and the time went down to 750M + the time for 2 mfence instructions.

This Skylake behaviour is very likely the result of a microcode update to fix erratum SKL079, MOVNTDQA From WC Memory May Pass Earlier MFENCE Instructions. The existence of the erratum shows that it used to be possible to execute later instructions before mfence completed, so probably they did a brute-force fix of adding lfence uops to the microcode for mfence.

This is another factor in favour of using xchg for seq-cst stores, or even lock add to some stack memory as a stand-alone barrier. Linux already does both of those things, but compilers still use mfence for barriers. See Why does a std::atomic store with sequential consistency use XCHG?

(See also discussion about Linux's barrier choices on this Google Groups thread, with links to 3 separate recommendations for using lock addl $0, -4(%esp/rsp) instead of mfence as a stand-alone barrier.

Dietz answered 23/5, 2018 at 19:47 Comment(4)

@SamuelLiew: hrm, there were some useful / interesting microbenchmarking results in those comments which are now significantly harder to find. I guess I'll edit them into this answer for now. I don't really think it was necessary to clean up comments on this answer to a relatively obscure question. I know that a few of the regulars in asm / x86 tags, myself included, "abuse" comments for discussions, but IMO it seems to have been working ok, and it's often possible to find chat comments with google if I can remember a few keywords and/or names of participants when I want to link it later. – Dietz 25/5, 2018 at 22:44

The previous comments can be found in this chatroom, if there are valuable information, simply edit them into the answer. – Selfinduced 11/12, 2019 at 11:44

@SamuelLiew: Thanks for restoring the link. Any idea why it disappeared? Maybe someone flagging as no longer needed? (which admittedly is probably true in this specific case; I think I do have the useful stuff in my answer at this point and the dust has pretty much settled on what we were figuring out at the time.) I can ask on meta if it this isn't already a well known thing and more people might want to read your answer. – Dietz 11/12, 2019 at 11:46

yeah it was flagged NLN. I've created a new link that links directly to the first day of the chat transcript. Don't sweat it, just flag your post again if the comment gets deleted. – Selfinduced 11/12, 2019 at 11:47

Out of order processors can generally reorder all instruction where doing so is possible, feasible, beneficial for performance. Due to register renaming, this is transparent to the machine code except for the case of loads and stores^† That's why people usually only talk about load and store reordering as that's the only observable kind of reordering.

^† Typically, FPU exceptions are also something where you can observe reordering. Some old of order processors have imprecise exceptions for this reason.

Overblouse answered 23/5, 2018 at 18:4 Comment(13)

Just to avoid confusion, let's mention that superscalar is a different feature. See also What is general difference between Superscalar and OoO execution? – Duckboard 23/5, 2018 at 18:20

@Duckboard I clarified my answer. Thank you for your note. – Overblouse 23/5, 2018 at 19:0

Most OoO CPUs have precise exceptions in general! Otherwise page faults wouldn't be able to resume at the right place. Perhaps you mean that most OoO architectures have imprecise FP exceptions? (Interesting, I didn't know that, but makes sense because many micro-architectures schedule FP instructions separately from the integer core. e.g. PowerPC even has penalties for an integer load reloading a recent FP store.) – Dietz 23/5, 2018 at 19:4

@PeterCordes I'd say most (if not all) modern OoO CPUs have precise exceptions. @ fuz Can you give an example of an OoO processor where only FP exceptions are imprecise? "Most out of order processors have imprecise exceptions for this reason" I don't understand this part. Also, how does register renaming provide transparency? I don't think they are related. – Cattleya 23/5, 2018 at 19:16

@HadiBrais OoO CPUs having imprecise exceptions generally is what was taught to us in our class on high-performance computing. Register renaming is an implementation technique that hides the effects of out-of-order scheduling because you can always pretend that a register already received the result of the previous instruction. With a traditional register file, you would have to program around the out-of-order execution. – Overblouse 23/5, 2018 at 22:25

So maybe you didn't mean that only FP exceptions can be imprecise, you meant in general, not specifically FP exceptions. Because I'm not aware of any architecture where only FP exceptions are imprecise. I've interpreted "transparent" as "automatic correctness" because that's what it usually means. It seems that you meant efficiency. Register naming is really an efficiency technique. – Cattleya 23/5, 2018 at 22:54

@HadiBrais I mean, you could also have a traditional register file and just wait until the dependencies are ready, or you could make this waiting explicit. I was talking about FP exceptions but yes, exceptions in general can be imprecise. I did not say that FP exceptions were the only ones. – Overblouse 23/5, 2018 at 23:9

But why specifically mentioning FPU exceptions? How is this typical? Why saying "Typically, FPU exceptions are also something where you can observe reordering"? – Cattleya 23/5, 2018 at 23:16

IMO your class is wrong and most OoO CPUs in common use have precise exceptions, except perhaps for a few rather obscure cases. Of course, precise exceptions are tough to implement in OoO so especially there was this idea that maybe you can get away without them, but it largely didn't pan out that way. – Firedrake 23/5, 2018 at 23:53

@Firedrake If the textbook they are using in class was written in the 90s or earlier, then OoO architectures with imprecise exceptions were not uncommon at that time. – Cattleya 24/5, 2018 at 0:27

@HadiBrais The lecture was based on Hennessy & Patterson's famous book. So, the answer is yes, it's a book from the 90's. – Overblouse 24/5, 2018 at 9:41

Googling showed that indeed some machines specifically with imprecise FP exceptions (but with precise non-FP exceptions) were popular around that time, such as Alpha. Sometimes it was called imprecise "arithmetic" exceptions - but it isn't clear if that's just another word for FP or if it could also include integer stuff like div-by-zero. – Firedrake 26/5, 2018 at 0:3

That strategy seems to mostly (from what I can tell) fallen by the wayside in the 2000s, as almost everyone is precise - and some of the performance of imprecise FP exceptions can be achieved by other effects such as sticky status bits. I'm curious if anyone knows of any archs in common use that still have imprecise exceptions though! – Firedrake 26/5, 2018 at 0:11

MFENCE on Skylake is an OoO exec barrier like LFENCE

Recommended topics

Hot tags