Why flush the pipeline for Memory Order Violation caused by other logical processors?
Asked Answered
C

1

16

The Memory Order Machine Clear performance event is described by the vTune documentation as:

The memory ordering (MO) machine clear happens when a snoop request from another processor matches a source for a data operation in the pipeline. In this situation the pipeline is cleared before the loads and stores in progress are retired.

However I don't see why that should be the case. There is no synchronisation order between loads and stores on different logical processors.
The processor could just pretend the snoop happened after all the current in-flight data operations are committed.

The issue is also described here

A memory ordering machine clear gets triggered whenever the CPU core detects a “memory ordering conflict”. Basically, this means that some of the currently pending instructions tried to access memory that we just found out some other CPU core wrote to in the meantime. Since these instructions are still flagged as pending while the “this memory just got written” event means some other core successfully finished a write, the pending instructions – and everything that depends on their results – are, retroactively, incorrect: when we started executing these instructions, we were using a version of the memory contents that is now out of date. So we need to throw all that work out and do it over. That’s the machine clear.

But that makes no sense to me, the CPU doesn't need to re-execute the loads in the Load-Queue as there is no total order for non locked loads/stores.

I could see a problem is loads were allowed to be reordered:

;foo is 0
mov eax, [foo]    ;inst 1
mov ebx, [foo]    ;inst 2
mov ecx, [foo]    ;inst 3

If the execution order would be 1 3 2 then a store like mov [foo], 1 between 3 and 2 would cause

eax = 0
ebx = 1
ecx = 0

which would indeed violate the memory ordering rules.

But loads cannot be reorder with loads, so why Intel's CPUs flush the pipeline when a snoop request from another core matches the source of any in-flight load?
What erroneous situations is this behaviour preventing?

Curse answered 7/4, 2019 at 19:52 Comment(1)
TL:DR: because x86 CPUs speculatively load out of order to achieve memory parallelism and avoid coupling dependency chains together if they both spill/reload.Nataline
J
19

Although the x86 memory ordering model does not allow loads to any memory type other than WC to be globally observable out of program order, the implementation actually allows loads to complete out of order. It would be very costly to stall issuing a load request until all previous loads have completed. Consider the following example:

load X
load Y
load Z

Assume that line x is not present in the cache hierarchy and has to be fetched from memory. However, both Y and Z are present in the L1 cache. One way to maintain the x86 load ordering requirement is by not issuing loads Y and X until load X gets the data. However, this would stall all instructions that depend on Y and Z, resulting in a potentially massive performance hit.

Multiple solutions have been proposed and studied extensively in the literature. The one that Intel has implemented in all of its processors is allowing loads to be issued out of order and then check whether a memory ordering violation has occurred, in which case the violating load is reissued and all of its dependent instructions are replayed. But this violation can only occur when the following conditions are satisfied:

  • A load has completed while a previous load in program order is still waiting for its data and the two loads are to a memory type that requires ordering.
  • Another physical or logical core has modified the line read by the later load and this change has been detected by the logical core that issued the loads before the earlier load gets its data.

When both of these conditions occur, the logical core detects a memory ordering violation. Consider the following example:

------           ------
core1            core2
------           ------
load rdx, [X]    store [Y], 1
load rbx, [Y]    store [X], 2
add  rdx, rbx
call printf

Assume that the initial state is:

  • [X] = [Y] = 0.
  • The cache line that contains Y is already present in the L1D of core1. But X is not present in the private caches of core1.
  • Line X is present in the L1D of core2 in a modifiable coherence state and line Y is present in the L1D of core2 in a shareable state.

According to the x86 strong ordering model, the only possible legal outcomes are 0, 1, and 3. In particular, the outcome 2 is not legal.

The following sequence of events may occur:

  • Core2 issues RFOs for both lines. The RFO for line X will complete quickly but the RFO for line Y will have to go all the way to the L3 to invalidate the line in the private caches of core1. Note that core2 can only commit the stores in order, so the store to line X waits until the store to line Y commits.
  • Core1 issues the two loads to the L1D. The load from line Y completes quickly, but the load from X requires fetching the line from core2's private caches. Note that the value of Y at this point is zero.
  • Line Y is is invalidated from core1's private caches and its state in core2 is changed to a modifiable coherence state.
  • Core2 now commits both stores in order.
  • Line X is forwarded from core2 to core1.
  • Core1 loads from cache line X the value stored by core2, which is 2.
  • Core1 prints the sum of X and Y, which is 0 + 2 = 2. This is an illegal outcome. Essentially, core1 has loaded a stale value of Y.

To maintain the ordering of loads, core1's load buffer has to snoop all invalidations to lines resident in its private caches. When it detects that line Y has been invalidated while there are pending loads that precede the completed load from the invalidated line in program order, a memory ordering violation occurs and the load has to be reissued after which it gets the most recent value. Note that if line Y has been evicted from core1's private caches before it gets invalidated and before the load from X completes, it may not be able to snoop the invalidation of line Y in the first place. So there needs to be a mechanism to handle this situation as well.

If core1 never uses one or both of the values loaded, a load ordering violation may occur, but it can never be observed. Similarly, if the values stored by core2 to lines X and Y are the same, a load ordering violation may occur, but is impossible to observe. However, even in these cases, core1 would still unnecessarily reissue the violating load and replay all of its dependencies.

Joanne answered 7/4, 2019 at 20:28 Comment(11)
I'm probably wrong but are all the letters right in the second and last paragraph? Anyway, In your last example, why load Z would read a stale data? There is not a single total ordering between the two cores. Putting another store in core 2 (e.g. store X) could create memory order violation as load Z cannot see the old value if load X see the new one. Anyway, the fact that loads can complete OoO explain why this flush is needed. Thank you very much!Curse
@MargaretBloom Thank you, you are right. I've fixed the example.Joanne
You wouldn't have to delay issue of loads into the ROB + RS. (Or were you using the other terminology where "issue" means what Intel calls "dispatch" to an execution unit?) Without speculative load ordering, you could still get memory parallelism for cache misses by prefetching everything that's going to be needed. But only let loads take data from L1d in program order. (But that would still be terrible for performance by coupling all loads into a dependency chain, destroying OoO exec for code with store/reload in 2 separate dep chains.)Nataline
So IDK, that overly simplistic strategy is almost a straw-man argument because that's obviously way worse than needed without speculating. Unless you meant dispatch, then it still destroys memory parallelism by preventing miss under miss or hit under miss.Nataline
Fun fact: a single core can trigger memory-order mis-speculation on its own. But I think this is for incorrectly-predicted aliasing between stores and loads, resulting in a bad state due to guessing that a load didn't need to forward from any previous stores (whose addresses weren't ready yet), and later discovering that store-forwarding was needed. (At least I think that's the mechanism.) It shows up under the same perf counter, but it's sort of a separate thing.Nataline
@PeterCordes That condition is called memory disambiguation misprediction, where a later load is incorrectly predicted to not be dependent on a previous store. I think the event you're thinking of is MACHINE_CLEARS.MEMORY_ORDERING which indeed count both disambiguation mispredictions and memory ordering violations. Another type of disambiguation mispredicts is 4k aliasing, but this has a dedicated counter. By "issue" I meant issuing the load from the load buffer, not from the RS or to the RS.Joanne
Yes exactly, I was surprised the first time I saw MACHINE_CLEARS.MEMORY_ORDERING mis-speculation get non-zero counts for a single thread. re: issue from load buffers: ok that makes sense. That's probably as far as you can go with speculative loads without needing a RS / ROB rollback, instead of "just" having the load buffer watch the cache line and signalling the RS to replay the load on mis-speculation. (There still needs to be a mechanism to redo the loads if they can read L1d and enter the load buffers out-of-order. I had been ruling out that speculation as well)Nataline
@PeterCordes There is an event called LOAD_DISPATCH.RS on Westmere, which counts "Loads dispatched that bypass the MOB." I remember one of the Intel patents I read describing a MOB scheduling logic that chooses between loads in the load buffer and loads from the RS. If the load buffer is empty, a load dispatched from the RS can bypass the load buffer and be directly issued to the L1D/DTLB, but a load buffer entry is still allocated for it. So loads need not go to the LB first and then issued from the LB. This design probably saves one cycle of load latency.Joanne
Loads dispatched from the MOB, rather than bypassing the MOB can be counted using LOAD_DISPATCH.MOB. I bet these events are still supported on later microarchitectures. I'm not sure how they can be useful, though, for perf tuning.Joanne
A well written, recent paper discussing machine clears (in light of speculative execution exploits) vusec.net/projects/fpvi-scsbFlorella
Thank you! Your answer explains the problem very well. I'm still unsure why the solution is to watch for modifications of [Y] instead of watching the modifications of [X]: TSO will not be violated if [X] isn't modified between read of [Y] and [X] (even if [Y] was modified meanwhile). I guess that the answer is: in general in a more complicated case we don't even know the address X yet, because it wasn't yet possible to evaluate it. My second guess is: even if we know address X, it might be difficult to snoop changes for something which is not in cache. Could please help me understand this?Contumacy

© 2022 - 2024 — McMap. All rights reserved.