Although the x86 memory ordering model does not allow loads to any memory type other than WC to be globally observable out of program order, the implementation actually allows loads to complete out of order. It would be very costly to stall issuing a load request until all previous loads have completed. Consider the following example:
load X
load Y
load Z
Assume that line x is not present in the cache hierarchy and has to be fetched from memory. However, both Y and Z are present in the L1 cache. One way to maintain the x86 load ordering requirement is by not issuing loads Y and X until load X gets the data. However, this would stall all instructions that depend on Y and Z, resulting in a potentially massive performance hit.
Multiple solutions have been proposed and studied extensively in the literature. The one that Intel has implemented in all of its processors is allowing loads to be issued out of order and then check whether a memory ordering violation has occurred, in which case the violating load is reissued and all of its dependent instructions are replayed. But this violation can only occur when the following conditions are satisfied:
- A load has completed while a previous load in program order is still waiting for its data and the two loads are to a memory type that requires ordering.
- Another physical or logical core has modified the line read by the later load and this change has been detected by the logical core that issued the loads before the earlier load gets its data.
When both of these conditions occur, the logical core detects a memory ordering violation. Consider the following example:
------ ------
core1 core2
------ ------
load rdx, [X] store [Y], 1
load rbx, [Y] store [X], 2
add rdx, rbx
call printf
Assume that the initial state is:
- [X] = [Y] = 0.
- The cache line that contains Y is already present in the L1D of core1. But X is not present in the private caches of core1.
- Line X is present in the L1D of core2 in a modifiable coherence state and line Y is present in the L1D of core2 in a shareable state.
According to the x86 strong ordering model, the only possible legal outcomes are 0, 1, and 3. In particular, the outcome 2 is not legal.
The following sequence of events may occur:
- Core2 issues RFOs for both lines. The RFO for line X will complete quickly but the RFO for line Y will have to go all the way to the L3 to invalidate the line in the private caches of core1. Note that core2 can only commit the stores in order, so the store to line X waits until the store to line Y commits.
- Core1 issues the two loads to the L1D. The load from line Y completes quickly, but the load from X requires fetching the line from core2's private caches. Note that the value of Y at this point is zero.
- Line Y is is invalidated from core1's private caches and its state in core2 is changed to a modifiable coherence state.
- Core2 now commits both stores in order.
- Line X is forwarded from core2 to core1.
- Core1 loads from cache line X the value stored by core2, which is 2.
- Core1 prints the sum of X and Y, which is 0 + 2 = 2. This is an illegal outcome. Essentially, core1 has loaded a stale value of Y.
To maintain the ordering of loads, core1's load buffer has to snoop all invalidations to lines resident in its private caches. When it detects that line Y has been invalidated while there are pending loads that precede the completed load from the invalidated line in program order, a memory ordering violation occurs and the load has to be reissued after which it gets the most recent value. Note that if line Y has been evicted from core1's private caches before it gets invalidated and before the load from X completes, it may not be able to snoop the invalidation of line Y in the first place. So there needs to be a mechanism to handle this situation as well.
If core1 never uses one or both of the values loaded, a load ordering violation may occur, but it can never be observed. Similarly, if the values stored by core2 to lines X and Y are the same, a load ordering violation may occur, but is impossible to observe. However, even in these cases, core1 would still unnecessarily reissue the violating load and replay all of its dependencies.