Can CPU Out-of-Order-Execution cause memory reordering?
Asked Answered
B

1

0

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.

In my opinion, Out-of-Order-Execution can't cause reordering because the results are always retired in-order as mentioned in this question.

To make my question more clear, let's say we have such an relax memory consistency architecture:

  • It doesn't have store buffer and invalidate queues
  • It can do Out-of-Order-Execution

Can memory reordering still happen in this architecture?

Does memory barrier has two functions, one is forbidding the Out-of-Order execution, the other is flushing invalidation queue and draining store buffer?

Bailee answered 6/4, 2022 at 14:32 Comment(10)
An effective OoO machine would still want to do speculative loads, for instance. In-order retirement doesn't help because memory accesses don't have to become visible to other cores at the same time as they retire.Palsgrave
@NateEldredge Could speculative loads also cause memory reordering? Besides, could you please elaborate why in-order retirement doesn't help. It's hard to understand the second sentence. Thanks.Bailee
Out-of-order execution of loads already is LoadLoad memory reordering, like @Nate said. The critical time for a load is when it reads from shared cache. Unlike for stores which is after retirement when they commit from the store buffer to cache. (x86 CPUs can only speculatively reorder loads, and then confirm the cache lines are still available to make sure LoadLoad ordering is obeyed.)Chiao
A weakly-ordered ISA wouldn't need to check later, just let loads take data whenever. And even leave loads in-flight after retirement, as long as they've been verified as non-faulting. That's what makes LoadStore reordering possible. How is load->store reordering possible with in-order commit?Chiao
Re: StoreLoad reordering from OoO exec of a load before an earlier store, see discussion on how are barriers/fences and acquire, release semantics implemented microarchitecturally? (the first couple comments also discuss LoadLoad)Chiao
@PeterCordes I find that CPU Out-of-order execution is so difficult to understand completely. So let's put it aside temporarily and discuss a more straightforward question. If I have an architecture described in my above question, do we still need memory barrier?Bailee
Is your "architecture" with no store buffer supposed to be an implementation of x86, and thus have to enforce LoadLoad ordering? If no, then LoadLoad reordering can trivially happen. But if it's an x86, iwe'd say it's a "microarchitecture". Without a store buffer, I guess store-data operations would have to execute as they retired (and became non-speculative), writing directly to L1d cache. If you had a way to still do memory disambiguation and let later loads execute before a store retired, you could still get StoreLoad reordering (the only kind x86's memory model allows).Chiao
@PeterCordes If it's not an implementation of x86, why LoadLoad reordering is easily happen given that invalidate queue is not exist? How the barrier prevent it?Bailee
If it's an ISA with that allows LoadLoad reordering, loads can execute in any order, just like in real AArch64 CPUs for example. Taking a value from L1d cache when they execute. Invalidate queues seem irrelevant. A memory barrier instruction that included LoadLoad ordering could work by simply waiting for all previous loads to execute before any later loads can. (Or even blocking all later instructions from issuing from the front-end into the back-end; that would be a full barrier on a CPU like this if it drains the ROB before any later instruction can run, since there's no store buffer.)Chiao
Let us continue this discussion in chat.Bailee
G
3

Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering

It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.

However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.

In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.

I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.

I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.

Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.

Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.

This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.

Grapheme answered 6/5, 2022 at 16:42 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.