Globally Invisible load instructions
Asked Answered
S

3

8

Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from the cache.
As it is generally stated that a load is globally visible when it reads from the L1D cache, the ones that do not read from the L1D should make it globally invisible.

Sweepings answered 30/5, 2018 at 16:56 Comment(14)
You might want to clarify what you mean by global visibility of loads, and/or clarify if there is any underlying question that prompted this one. For example, it would help the answerers if you explained what it means to you the difference between a globally visible load and one that is not.Lyautey
It might be useful to know that terms like global visibility (e.g., of stores) are helpful abstractions to understand the guaranteed ordering and visibility properties of an architecture, but at the uarch level or even electrically many operations that are conceptually globally visible never actually become visible to any other agent. Consider, for example, a CPU that gets a cache line in the M state, and makes many writes and reads to it before eventually relinquishing the line. None of those accesses, other than the stores that affect the final state of the cache line ...Lyautey
... will ever be "observed" or "visible" by other agents: indeed, no evidence of them ever left the local CPU! Now in principle those intermediate stores could have been observed if an invalidation came in at the right time, but by construction one didn't. Still, we might say those intermediate store became visible at some point, e.g., when they were written to L1 (since at this point an RFO from another core will see the write as the line is written back). I'm not aware of a similar concept for loads, however - which is why clarification could be useful.Lyautey
As your answer suggested, it was for semantic clarity. I happened to see an answer from PeterCordes on a different thread where he suggested that a load is globally visible when it reads from L1D. Now these discussions have got me to a state where I have more questions on load load reorderingSweepings
here [ stackoverflow.com/questions/38034701/… ] is the post I was referring to.Sweepings
Thanks joz! It's often good to include a quote from answer that caused you to have another question, so responders can have some context. I understand now what prompted the question. I think Peter just omitted discussing SLF since the question was about the L1, but I think I would have written those two setences without reference to load visibility, something like: "Loads get their value with from a globally visible store via the L1 cache, or via store-forwarding from a local store. MFENCE eliminates the second case by waiting for local stores to become GV before reading the L1."Lyautey
@BeeOnRope: I believe SFENCE LFENCE in that order can also help avoid store load forwarding instead of MFENCESweepings
No, sfence does more or less nothing, and lfence is just an execution barrier, but doesn't affect the store buffer, so after those finish, you'll still (probably) have entries in the store buffer. sfence and lfence are more or less irrelevant for memory ordering with WB memory, except for weakly ordered stuff like NT stores. lfence kind of has a "second life" as an execution barrier: useful for serializing execution, e.g., for benchmarking, nothing to do with memory ordering.Lyautey
I think it is better to raise this in a separate thread. As per intel SDM 11.5, SFENCE drains store buffers. That makes be believe in SFENCE LFENCE could be usedSweepings
You're the one who raised it, if you want to start another thread, go ahead, but I'll reply to your comment here. Section 11.5 in which volume? What is the title of the section? Anyways, the manual is clear that sfence doesn't drain store buffers (it may evict write combining buffers but that's another thing entirely). For example in 11.4.4.3 Vol 1 it says Note that the sequences LFENCE;SFENCE and SFENCE;LFENCE are not equivalent to MFENCE because neither ensures that older stores are globally observed prior to younger loads. I.e., it doesn't drain the store buffers. You can test it too.Lyautey
Sorry, it is section 11.10 in Volume 3. I see similar threads on the subject. I do not have comment privileges on other threads yet :). If I start another thread, it would be a duplicate of [those](stackoverflow.com/questions/37452772/… )Sweepings
Interesting, indeed it's right there: It also insures that the contents of the store buffer are always drained to memory in the following situations ... (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores. I'm pretty sure it has got to be wrong though (e.g,. obsolete and not updated), although I suppose it's possible that it does drain the store buffer, but doesn't prevent store-load reordering in general because later loads can still pass it? On AMD it executes 4 per cycle so we can quite sure it isn't draining the store buffer!Lyautey
Which is why I feel, if the manual is right, SFENCE + LFENCE can also handle the store-load forwarding case that we were dealing with earlier. I am not saying that SFENCE+LFENCE is equivalent to MFENCE in general, but in this case, perhaps both sufficeSweepings
that would be very interesting if true, but I strongly doubt it. If sfence really drained the store buffer, then it seems to be that sfence; lfence would act as an "mfence", since all earlier loads must complete before the lfence and if the the store buffer is empty after it ends up being just as effective as an mfence, blocking both store-load reordering (the "store buffer" reordering) and the SLF reordering. Certainly an interesting idea though!Lyautey
M
13

The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and other threads can't directly observe it.

But once the dust settles after out-of-order / speculative execution, we can tell what value the load got if the thread stores it somewhere, or branches based on it. This observable behaviour of the thread is what's important. (Or we could observe it with a debugger, and/or just reason about what values a load could possibly see, if an experiment is difficult.)


At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible, updating the single coherent+consistent cache+memory state. On x86, where StoreStore reordering isn't allowed, this TSO (Total Store Order) agrees with program-order of each thread. (I.e. the total order is some interleaving of program order from each thread). SPARC TSO is also this strongly ordered.

(Correctly observing the global order of your own stores relative to other stores requires mfence or similar: otherwise store-forwarding means you can see your own stores right away, before they become visible to other core. x86 TSO is basically program-order plus store-forwarding.)

(For cache-bypassing stores, global visibility is when they're flushed from private write-combining buffers into DRAM. Intel Line Fill Buffers or any equivalent private write-combining mechanism where store data is still not visible to other CPUs is effectively part of the store buffer for our reordering purposes.)

On a weakly-ordered ISA, threads A and B might not agree on the order of stores X and Y done by threads C and D, even if the reading threads use acquire-loads to make sure their own loads aren't reordered. i.e. there might not be a global order of stores at all, let alone having it not be the same as program order.

The IBM POWER ISA is that weak, and so is the C++11 memory model (Will two atomic writes to different locations in different threads always be seen in the same order by other threads?). But the mechanism in practice on POWER is that (retired aka graduated) stores become visible to some other cores before they become globally visible by committing to L1d cache. Cache itself really is coherent even in POWER systems, like all normal CPUs, and allows sequential-consistency to be recovered with barriers. These multiple-order effects only happen due to SMT (multiple logical CPUs on one physical CPU) providing a way to see stores from other logical cores without going through cache.

(One possible mechanism is be letting other logical threads snoop non-speculative stores from the store buffer even before they commit to L1d, only keeping not-yet-retired stores private to a logical thread. This could reduce inter-thread latency slightly. x86 can't do this because it would break the strong memory model; Intel's HT statically partitions the store buffer when two threads are active on a core. But as @BeeOnRope comments, an abstract model of what reorderings are allowed is probably a better approach for reasoning about correctness. Just because you can't think of a HW mechanism to cause a reordering doesn't mean it can't happen.)

Weakly-ordered ISAs that aren't as weak as POWER (in practice and/or on paper) still do reordering in the local store buffer of each core, if barriers or release-stores aren't used, though. On many CPUs there is a global order for all stores, but it's not some interleaving of program order. OoO CPUs have to track memory order so a single thread doesn't need barriers to see its own stores in order, but allowing stores to commit from the store buffer to L1d out of program order could certainly improve throughput (especially if there are multiple stores pending for the same line, but program order would evict the line from a set-associative cache between each store. e.g. a nasty histogram access pattern.)


Let's do a thought experiment about where load data comes from

The above is still only about store visibility, not loads. can we explain the value seen by every load as being read from global memory/cache at some point (disregarding any load-ordering rules)?

If so, then all the load results can be explained by putting all the stores and loads by all threads into some combined order, reading and writing a coherent global state of memory.

It turns out that no, we can't, the store buffer breaks this: partial store-to-load forwarding gives us a counter-example (on x86 for example). A narrow store followed by a wide load can merge data from the store buffer with data from the L1d cache from before the store becomes globally visible. Real x86 CPUs actually do this, and we have the real experiments to prove it.

If you only look at full store-forwarding, where the load only takes its data from one store in the store buffer, you could argue that the load is delayed by the store buffer. i.e. that the load appears in the global total load-store order right after the store that makes that value globally visible.

(This global total load-store order isn't an attempt to create an alternative memory-ordering model; it has no way to describe x86's actual load ordering rules.)


Partial store-forwarding exposes the fact that load data doesn't always come from the global coherent cache domain.

If a store from another core changes the surrounding bytes, an atomic wide load could read a value that never existed, and never will exist, in the global coherent state.

See my answer on Can x86 reorder a narrow store with a wider load that fully contains it?, and Alex's answer for experimental proof that such reordering can happen, making the proposed locking scheme in that question invalid. A store and then a reload from the same address isn't a StoreLoad memory barrier.

Some people (e.g. Linus Torvalds) describe this by saying the store buffer isn't coherent. (Linus was replying to someone else who had independently invented the same invalid locking idea.)

Another Q&A involving the store buffer and coherency: How to set bits of a bit vector efficiently in parallel?. You can do some non-atomic ORs to set bits, then come back and check for missed updates due to conflicts with other threads. But you need a StoreLoad barrier (e.g. an x86 lock or) to make sure you don't just see your own stores when you reload.


Proposed definition: A load becomes globally visible when it reads its data. Normally from L1d, but the store buffer or MMIO or uncacheable memory are other possible sources.

This definition agrees with x86 manuals which say that loads aren't reordered with other loads. i.e. they load (in program order) from the local core's view of memory.

The load itself can become globally visible independently of whether any other thread could ever load that value from that address.

Although perhaps it would make more sense not to talk about "global visibility" of cacheable loads at all, because they're pulling data from somewhere, not doing anything with a visible effect. Only uncacheable loads (e.g. from an MMIO region) should be considered visible side-effects.

(On x86, uncacheable stores and loads are very strongly ordered, so store-forwarding to an uncachable store is I think impossible. Unless maybe the store was done via a WB mapping of the same physical page as the UC load is accessing.)

Milker answered 31/5, 2018 at 6:29 Comment(65)
BTW, this answer isn't x86-specific. I'm just using x86 as an example, because I know how it works in detail. I think the key points all apply to other architectures, and the fact that even strongly-ordered x86 has these reorderings (but doesn't allow LoadLoad reordering) makes it a good example.Milker
To be clear, I don't think you need to invoke partial store forwarding to show how the strong model is broken: you don't need need to invoke store forwarding! Just having a store buffer already causes store-load reordering, by delaying stores, even without SLF. Then adding store forwarding (without any partial stuff) adds yet another type of reordering: where loads may see their own loads out of order with respect to the global order (this doesn't fit neatly in the 2x2 LoadLoad type reorderings: you need a separate rule for it).Lyautey
@BeeOnRope: For a while I thought that a store + reload from the same address would end up effectively being a StoreLoad barrier (enforced by memory-order mis-speculation pipeline stalls where needed), because of the wording in Intel's manuals, and because I "knew" that x86 doesn't allow loads to reorder with other loads. More importantly, it's partial SLF that breaks my idea of there being a total order of loads and stores that all cores agree on, so I thought that was an interesting thing to think about in terms of what load global visibility means. See the end of this answer.Milker
Again, I don't think you need to use partial SLF to break the total order, regular SLF does that just fine (the example which shows that in the Intel manual is 8.2.3.5 and uses only regular non-partial SLF and is pretty much called "store forwarding can happen"). So you either consider that this particular rule is an exception to the "no load-load" reordering part, or you consider the stores reordered instead and the loads still in their program order, but the two ways of thinking about them are equivalent.Lyautey
In the abstract, it's probably easier to think of it this way: there is a global total store order. This is, for example, the order always observed by an agent that not making any stores. Now every thread has also a program order of loads. Each load, in order, either (a) receives its value in a consistent way from the total store order or (b) receives its value from a local store. By "consistent way" I just mean that if one load receives its value from T100 (arbitrary label with higher numbers meaning later) the next load will receive their value from T100 or later.Lyautey
Also, all of the various no-reordering rules apply to this "alignment" of the per-CPU program read order (PPRO). So for example, a load followed by a store which ends up at T200 in the GO, will receive its value from earlier than T200. No LoadLoad as above means loads take their values (if from the GO) from non-decreasing store labels. No StoreStore just means that all stores get a GO label consistent with their program order relative to other stores.Lyautey
So all the 2x2 reodering rules are about case (a), reading from the GO - but in case (b) you use a different rule, which is that you are allowed to read your local store out of order, in particular with a later label than surrounding stores that used the GO. Note that you don't need to try to create a global order of loads here: just the natural per-CPU order. It's only meaningful to say that one store came before another if it took its value from a store with an earlier label.Lyautey
The above is all assuming aligned single-size loads, but it extends naturally to the overlapping cases: here you can have (a) and (b) occur in the same load. The part coming from (a) will still be atomic, but overall sure, you might see a value that never existed in the GO, because part of your read from the local order. In a way this is no more weird than two separate loads from the two parts: you would again see the same apparent mis-ordering relative to the GO. So while I think partial store forwarding is interesting, I don't see it as doing anything fundamentally different than normal SLF.Lyautey
@BeeOnRope: sorry, I didn't fully read all of your comments, but I think you missed the key point that I wasn't claiming that the total order of loads and stores interpretation included the standard x86 rule against any LoadLoad reordering. i.e. that if you use that model, your reordering rules have to include the store buffer. That's why it takes partial SLF to break it. Like I commented under your answer just now.Milker
@PeterCordes " all CPUs can agree on a single global total order of stores and loads by all CPUs that write or read the single coherent global state of memory. I.e. all the load results can be explained by putting all the loads and stores by all threads into some order." In other words, a sequentially consistent view at one particular memory address. Which is what a fully coherent cache would ensure, thus making the cache invisible to the programmer. Then it is true for any architecture regardless of how weak its memory model is.Sweepings
@PeterCordes I believe this is your point. If I take any cache line and if I represent the data present in the cache line as C(t1), C(t2) ...C(n) etc, at times t1, t2..tn etc, any aligned load instruction would get the value from only one of those cache states. For eg, a 64 bit load would get the entire 64 bits from either C(t1) or C(t2), but never would it get say 32 bits from C(t1) and the other 32 bits from C(t2). Partial store load forwarding breaks or at least make it appear to break the above assumption. A followup question, how about unaligned loads that spans multiple cache lines ?Sweepings
@joz: A sequentially-consistent view of memory would imply that the global total order was consistent with program-order of every thread, i.e. just interleaving them somehow, rather than shuffling them together. I was proposing fully coherent, but not seq cst. And a very weak memory model might allow thread A and thread B to disagree about which order two stores by a third thread (or by two other threads) happened in, even if the reading threads used barriers / acquire-loads for both reads. This would violate the global total order model.Milker
@joz: A cache-line split load or other non-atomic load is exactly equivalent to two atomic loads in unknown order, then merge. Any reordering would be possible.Milker
@PeterCordes Please note that I said sequentially consistent view at one particular memory address. I did not meant multiple memory locations. Since stores to the same address would not be reordered within a thread, I went ahead with the term sequentially consistent if you look at only one memory locationSweepings
@PeterCordes - I see, I didn't actually understand your answer then (I didn't for example, know what you meant by the "How can you tell the difference?" part - the difference between what?). So I think I understand now: you are, above the first ---, talking about a model with a single global order of loads and stores: this order may be more or less constrained depending on the allowed reorderings (e.g., no StoreStore reorder means stores appear in program order, same for LoadLoad and the other 2), right? Then you say that partial SLF is the way to show this isn't true, right?Lyautey
@PeterCordes I think I agree with Bee here, but I'm not sure I understand your answer. "Except that's not true, the store buffer breaks this model" But why did you make that statement? x86 is not sequentially consistent anyway. "According to that interpretation/definition of load order" Not sure interpretation of what. The second half of your answer looks good to me.Jilli
I think I see what you're saying now - when you discuss this global load+store order, you aren't making it respect the various reordering rules. So even on x86, with no LoadLoad reordering, you allow loads to appear out of program order in this model, right? Then yes, in that case, you could "explain" SLF via load-load reordering. In fact, you can explain any series of values observed by any processor, since with arbitrary load-load reordering each CPU can pick any value it wants without restriction from the store part of the global order!Lyautey
The problem is that model is way too weak: it has way more flexibility than the actual model allows and so you can "disprove" it simply by noting that most of the "negative" (can't happen) litmus tests in the Intel or AMD manuals (for example, it's not restricted to x86) disprove it, and that you couldn't program against this model. You need a way to restrict the ordering so that it incorporates all the interesting restrictions in the memory model, which in x86 is "reorderings other than StoreLoad are banned, but store-forwarding is allowed".Lyautey
@Lyautey How can SLF allow load-load reordering?Jilli
@HadiBrais - see the examples I mentioned on your answer. Neither allowed result can be explained by loads occurring in-order against a global total store order. That's true even with StoreStore reordering. So you need some rule to explain that result (this what Intel got wrong for many years and how their CPUs allowed reorderings that their manual didn't!). That rule could be saying that LoadLoad reordering is in fact possible (kind of what Peter is doing above the first break, IIUC) - but this is now too weak and allows many reorderings that do not occur in practice.Lyautey
So the solution is to have narrower rule that explicitly treats stores from the same CPU differently: take their values in-order from the global order or from the local store buffer (forwarding). This second case causes the load to appear to have executed out of order with respect to surrounding loads that used the GO, but this is restricted to the store forwarding case. In practice, many concurrent algorithms are not affected by store forwarding, so it is important that the effect is narrowed like this. BTW, I really recommend the x86-TSO paper. It covers this in detail.Lyautey
@Lyautey Hmm, this seems to be something I don't have a solid understanding of. I need to do more research on that.Jilli
@HadiBrais - you're in good company! The paper I linked and the predecessor one shows how both Intel and the academics who were trying to understand this got it wrong for years (the paper above is a follow on to an earlier paper, x86-CC which didn't capture the semantics correctly). One could argue that Intel is still a bit unclear about the store forwarding part of the their memory mode, since they say, for example: Any two stores are seen in a consistent order by processors other than those performing the stores but then they don't detail how it works for those performing the stores!Lyautey
Later on in 8.2.3.5 they do clarify it a bit: The memory-ordering model allows concurrent stores by two processors to be seen in different orders by those two processors; specifically, each processor may perceive its own store occurring before that of the other. (emphasis mine). It's not clear that the emphasized part is normative though: one might assume it's just an artifact of this particular test (to be clear, I believe it is guaranteed to work like that: cores performing the stores will only perceive a re-ordering that makes their store appear earlier, never later).Lyautey
@Lyautey Well, it does sound a like a guarantee to me too.Jilli
@HadiBrais - yes, I think it's a guarantee that is allowed - but I mean it doesn't guarantee a different case can't occur: e.g., that of each processor observing it's own store after the other. Indeed, all of the "positive" cases (cases which end in "<final state...> is possible") just show one possible re-ordering, they don't exclude others. Only the negative cases do that. AFAIK we don't have a negative test for this case.Lyautey
@BeeOnRope: ok, got back to this and finished my edit. I think it's clearer now what parts are fact and what point I'm trying to make about where load data comes from, and why partial SLF is interesting here and supports my description in the last section. And BTW, are any ISAs so weakly-ordered that threads A and B might not agree on the order of stores X and Y done by threads C and D? Or do weakly-ordered ISAs still always have a global order, even if it doesn't match program order? Or is there a division between weak and very-weak ISAs? If you don't know, I may ask a new question.Milker
@PeterCordes - your last question is a bit tricky to answer. Do you mean stores to different locations X and Y? I'd assume yes, but you meant to the same location then I'm not aware of any such ISA since it wouldn't really be coherent since different orders to the same location would presumably mean either the "final permanent value" for that location would be disagreed on forever by C and D, which basically means memory is non-coherent, or the almost equally weird effect that that one of the threads would have see the writes in both orders i.e., see one write twice: e.g., WA, WB, WA.Lyautey
Let's assume you are talking about different locations though. It's still difficult to get at a precise question about whether an "underlying" total order exists (which I think is what you are asking). Your scenario is basically the IRIW test. So are you asking if threads C and D will always agree on a global order without memory barriers anywhere in A-D? The answer to that is trivially no since load-load reordering can make them disagree (even if there is an "underlying" total order). What about the scenario where you can insert whatever barriers you want? I think in that case ...Lyautey
... you can always insert enough barriers to get full sequential consistency, including seeing IRIW in a consistent order on any sane ISA. If it were not possible, I think some basic concurrent idioms would be impossible to implement (basically if you can't recover SC for arbitrary sequences of reads and writes you are screwed). Maybe you mean something more subtle, like "if you only add barriers on the reading threads so they observe the underlying global order (if one exists)" will they see a consistent order? Or maybe you ask asking if the cache subsystem is always sequentially coherent?Lyautey
@BeeOnRope: Yes, I meant with barriers on the reading side, with thread A and thread B both doing acquire-loads of two cache lines. Those locations are stored to by two different threads (one store each), so no store barriers will make any difference and there's no synchronization between the two threads doing the stores. And yes of course I'm talking about a platform where you can use barriers to get seq-cst or acq-rel.Milker
@PeterCordes - sorry for the confusion about the store barrier: I was talking about IRIW (which only has one store per thread), but I was also thinking about another example where T1 makes two writes, and then the question is whether T2 and T3 always see those writes in a consistent order (which may be the reverse of program order in T1), and whether such is possible without any barriers on the writing side. In fact, the IRIW example isn't that good for answering this question: we know that on any platform that allows recovery of SC (all of the interesting ones) ...Lyautey
... you can can prevent IRIW reordering, and the barriers must go between the reads (there is nowhere else for them to go, after all since they should go between pairs of memory accesses), and so a platform with only one type of barrier will necessarily either (a) not need them at all (e.g., x86) or (b) need them between both reads. In the latter case, how can you distinguish between a non-SSO (single-store order) and an SSO implementation with load-load reordering? The one barrier kills load-load reordering so it's hard to say if the IRIW effect was due to load-load or non-SSO.Lyautey
Here SSO is a term I made up: "Single Store Order" to refer to what I think you are asking about: SSO platforms have a single total global order of stores, and CPUs reading from this order all agree on the order - but the order isn't necessarily consistent with program order of stores on each thread (e.g., the local store buffers may not commit in order). The question now is what litmus test would reveal the difference? IRIW doesn't seem like a good one, w/o looking at detailed barrier semantics. The answer seems to be that yes POWER (in practice) and ARM (in theory) are both non-SSO.Lyautey
As it turns out, academia already apparently has a name for this: "multiple-copy atomic" means the same thing as SSO. Basically in a model that breaks down the hardware ordering model into two parts: the local CPU which might to various local re-orderings, and a memory subsystem that the local CPU reads and writes to, this concept only deals with the latter, and means (just like SSO), that the memory subsystem proceeds through a series of states (where each state reflects 1 or more new writes) atomically: so all threads agree on the this order when reading from the memory subsystem.Lyautey
ARM (architecturally, perhaps actually) and POWER (actually) are not multi-copy atomic (MCA) in this sense, so if one directly maps the abstract concepts to actual hardware (CPU -> core, memory subsystem -> cache & memory + coherence protocol) one might imagine that it must mean that their cache coherence protocol isn't sequentially coherent, i.e., is uses some weakening of MESI. As it turns out, it seems like this is not the case: the non-SSO/MCA behavior seems to arise still from local effects across hardware threads. So the (actual) cache subsystem is still doing its thing ...Lyautey
... but from the point of view of two logical (which is all ISA models and software memory models care about, generally) CPUs C and D, writes from other CPUs may appear in an inconsistent order (i.e., no SSO), and this happens because logical CPUs may see writes from sibling hardware threads (same core) earlier than those from other physically distinct cores. Basically it makes the "senior store" case relevant versus the "junior store": I believe on IBM chips that HW threads can snoop committed-but-not-globally-visible stores from their siblings.Lyautey
This has the interesting effect that memory barriers on the read side can still solve this problem "locally" (without off-core coherence), if you clarify "locally" to mean "within the physical core" rather than "within the logical CPU" (the latter concept doesn't really make sense from a hardware perspective), since the "fix" is for the barrier to just to flush the store buffer (across all sibling threads), since that will prevent the store-forwarding of cross-thread senior stores. Ultimately, this means that the answer to your question falls kind of in the middle:Lyautey
From a ISA/software perspective, it seems that these platforms are non-SSO since even if you prevent local read-reordering (e.g, with a lwsync - which is an execution fence like lfence or a load-load address dependency), you can see the same two stores in different orders across threads. However, the actual hardware (at least in POWER) mechanism still ends being core-local and so the barriers can still be implemented at the core-level against a "plain MESI" sequentially coherent memory subsystem. One could certainly imagine a world where you got non-SSO behavior from an effect ...Lyautey
... that wasn't core-local: e.g., some buffering or queuing of MESI protocol messages at the socket level, or allowing the cache coherence protocol to reorder requests. This is also allowed by the POWER and ARM ISAs, but I'm not aware of it occurring and the paper I'll link in a moment seems to think it doesn't. You'd still be required to be able to get back to SC, so at least some barriers of this type would need to be non-local: interacting with the cache coherence protocol to prevent the reordering, and thus could take hundreds rather than tens of cycles.Lyautey
Most of what I know about ARM and POWER comes from this paper, which I recommend highly. It both platforms in great detail, and with a mix focus on ISA-guaranteed (i.e., abstract) behavior and a description of the hardware that in practice might lead to these re-orderings and very useful litmus tests. In the past I was kind of against mixing hardware-level reordering talk (e.g., talk of store buffers, forwarding, out of order execution) with the abstract ISA-guaranteed hardware memory model, since I thought if the important part ...Lyautey
... was the latter, focusing on the former was just a distraction and could lead you to wrong conclusions and (especially) lead to arguments about behaviors that weren't going to be visible anyways (i.e., arguing about hardware really does in order to determine if some code is safe). After finally sorting out how x86 works, and reading this paper with respect to ARM and POWER, I think an abstract hardware model is actually quite useful: especially with complex barrier types it's often much easy to explain the behavior in terms of some type of abstract machine, than it is to try to ...Lyautey
... describe all the possible re-orderings totally abstractly, evidenced even by the ongoing confusion over SLF on x86 and no LoadLoad possibly leading to a way to implement StoreLoad barrier. If you instead describe the x86 model as a sequentially consistent memory subsystem acted on by CPUs that have store buffering and store forwarding, everything is clear (at least for WB memory :))! It doesn't actually matter if hardware really works that way: x86, for example, doesn't just get StoreLoad because of the store buffer: it also may simply execute loads earlier than stores in an ...Lyautey
OoO way, but the effect is (I think) indistinguishable from the store buffering delaying of loads. This "abstract hardware" model really starts to shine when talking about ARM and POWER, which are much weaker, and have a lot of barrier types. You can actually understand the various barrier effects (lwsync is an execution barrier, isb stops speculation across branches, etc) in terms of the execution of this abstract machine, which is often much easier than trying to put it all in a mathematical model (accurate but hard to reason about) or in a litmus-test based model (possibly incomplete).Lyautey
@joz: BTW, I made a major update to my answer to avoid the confusion the first version created.Milker
@BeeOnRope: Thanks for all your comments. I was going to post a new question so you could turn that into an answer, but there's already an abstract C++11 version of my exact question: Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. The accepted answer says POWER can do that reordering with both threads using acquire loads (in opposite orders), so we can rule out loadload reordering within the readers.Milker
@PeterCordes - we'll I'd argue that your question was about hardware, and that question is about the C++ memory model, which you can just answer with reference to the language in the spec, regardless of what happens on hardware, but the answer goes above and beyond by explaining that this reordering can happen on POWER, consistent with my current understanding, so it's probably as close to a clear version of your question as we are going to get.Lyautey
@BeeOnRope: yeah. I started typing up an answer there which explains the possible HW mechanism, as a point of interest because the very weak C++ model allows it. Of stores becoming visible to other threads on the same core before becoming globally visible. I'll move some of this answer there and link to it, because the off-topic stuff about that got too large. (Or if you want to write up your comments as an answer, let me know.)Milker
Sounds good. I don't have plans to turn this into an answer, it was more an investigation I wanted to do anyways to clear some things myself. BTW, the OP found an interesting quote in Vol 3 which says that sfence flushes the store buffers, which is news to me! The quite could just be wrong or maybe it does. Of course, flushing the buffers itself doesn't do much without other control over OoO execution, but combined with lfence it implies a full fence I think (which is one reason it doesn't seem likely it actually flushes the buffers).Lyautey
@PeterCordes: Will two atomic writes to different locations in different threads always be seen in the same order by other threads ?. Don't you think this is possible with DEC Alpha as well ?Sweepings
@joz: IDK. I don't think any implementations had SMT, so although on paper the ISA looks like it allows it, probably no hardware did it. I should maybe edit that to make it clearer that ARM is just an example, and POWER+ARM isn't meant to be an exhaustive list. Alpha still uses MESI so there's only one global domain, I think. We're talking about threads using load fences to make sure their loads sample their coherent view of memory in a known order, so extra-weak load-ordering rules for the unfenced case is irrelevant.Milker
@PeterCordes: I was referring different physical cores and loads not using fences. Based on the explanation on how DEC Alpha appear to reorder dependent loads, it occurred to me that multi-copy-atomicity is not maintained in Alpha (without fences ofcourse) even if CPU executes the load instructions in order.Sweepings
@PeterCordes: Alpha still uses MESI so there's only one global domain, I think. As I understand, for DEC Alpha, this global domain argument is valid only with proper use of fences. To put it another way, Alpha guarantees multi-copy-atomicity only with use of fences.Sweepings
Ok, sure, but that's just local load reordering. That Q&A was asking about independent loads anyway, so most weakly-ordered ISA allow that local load reordering. Without load fences, you can't tell the difference between local load reordering vs. global store-order differences. It seems you're not saying anything more interesting than "Alpha needs fences when reading stuff", which is already a well-known fact. So do other ISAs. For independent loads, Alpha is not special in this regard.Milker
@PeterCordes: I was trying to go to the root cause of the reordering. Typically, reorders happen because the CPU actually does the 'load' out of order in the pipeline. But in case of Alpha, even if the CPU somehow happen to do the load in-order without fences, it is still possible that the loads might appear as out of order due to the cache coherence problem. I am trying to focus more on the exact 'point of reorder' - within pipeline, or writes to L1D or cache coherence happening out of order.Sweepings
@joz: ok sure, it's interesting to learn about some of the different mechanisms that can cause memory-reordering on real microarchitectures, if you're curious about HW. To write correct programs, assume anything's possible unless (in asm) you have an ISA guarantee about ordering, or in C++, what the spec says. But yes, on the few Alpha models with dependent-load reordering because of split cache banks or whatever it was, I think you'd call it a local load-reordering effect that happens in L1d. I don't know of any CPUs that do value-prediction, but that can make weird stuff happen, too!Milker
@PeterCordes Yes, exactly. Coming to value-prediction do you agree with this [answer ? ](stackoverflow.com/questions/16983321/…)Sweepings
@joz: it doesn't explain any mechanism by which that could happen. But sure, value prediction is one way a core could reorder a dependent load and guess that it knew a useful value to use for the load result of a load whose address input wasn't ready yet. It's not a very good answer, but yeah memory ordering (including at compile time) doesn't have to respect causality.Milker
@PeterCordes At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible --> Except writing cores that could SLF. True only if an mfence is usedMichele
@DanielNitzan: As discussed in another comment, that statement should be understood to imply that any readers control for local reordering if they want to observe their own stores in the global order. IDK if I want to edit this in every answer I've ever said it in, especially when this answer already goes into detail on the effects of store forwarding. Did you find my answer misleading, or are you worried about other future readers?Milker
@PeterCordes Yes, there are couple more answers with this statement. I pointed this out for future readers, but it might just be a nitpick :)Michele
@DanielNitzan: In this answer, I decided it makes more sense than anywhere else to be explicit about this nit. Since store-forwarding is kind of the whole point of the answer, connecting that TSO paragraph with the later points is a good thing. If you're going to comment on future questions, maybe leave a link to this or How does Intel X86 implements total order over stores for future readers as part of the comment..Milker
@DanielNitzan: Maybe even phrase it in a way that doesn't imply the answer needs changing, so it works as a note to future readers on its own, especially where it's not buried too deep in a comment thread.Milker
@PeterCordes makes total sense, will do !Michele
@DanielNitzan: Thanks. And if you see anything that really does need editing, please do let me know in a comment (and/or leave a suggested edit). I just wanted to avoid having to make more edits than necessary. :PMilker
J
2

Let me expand the question a little bit and discuss the correctness aspect of implementing store-load forwarding. (The second half of Peter's answer directly answers the question I think).

Store-load forwarding changes the latency of the load, not its visibility. Unless it got flushed due to some misspeculation, the store eventually is going to become globally visible anyway. Without store-load forwarding, the load has to wait until all conflicting stores to retire. Then the load can fetch the data normally.

(The exact definition of a conflicting store depends on the memory ordering model of the ISA. In x86, assuming the WB memory type, which allows store-load forwarding, any store that is earlier in program order and whose target physical memory location overlaps that of the load is a conflicting store).

Although if there is any concurrent conflicting store from another agent in the system, that might actually change the value loaded because the foreign store may take effect after the local store but before the local load. Typically, the store buffer is not in the coherence domain, and so store-load forwarding may reduce the probability of something like that happening. This depends on the limitations of the store-load forwarding implementation; there is usually no guarantees that forwarding will happen for any particular load and store operations.

Store-load forwarding may also result in global memory orders that would have not been possible without it. For example, in the strong model of x86, store-load reordering is allowed and together with store-load forwarding may allow each agent in the system to view all memory operations in different orders.

In general, consider a shared memory system with exactly two agents. Let S1(A, B) be the set of possible global memory orders for the sequences A and B with store-load forwarding and let S2(A, B) be the set of possible global memory orders for the sequences A and B without store-load forwarding. Both S1(A, B) and S2(A, B) are subsets of the set of all legal global memory orders S3(A, B). Store-load forwarding can make S1(A, B) not be a subset of S2(A, B). This means that if S2(A, B) = S3(A, B), then store-load forwarding would be an illegal optimization.

Store-load forwarding may change the probability of each global memory order to occur because it reduces the latency of the load.

Jilli answered 30/5, 2018 at 23:57 Comment(15)
The bolded part seems definitely wrong. A processes which admits re-orderings due to store-forwarding will certainly have memory orders that are possible on that system, but impossible in a stricter system without store-forwarding. As a fairly trivial examples, consider two systems with fully coherent caches, one with a store buffer and store forwarding, and one without either. Neither system reorders load or store execution relative to each other. The second system will behave as sequentially consistent, and first will not and will have many more possible memory orders.Lyautey
For example, in the first system, the "Dekker algorithm failure" litmus test of 8.2.3.4 in Vol 3, Intel SDM is possible on the first system (as it is on x86), but not on the second system. In this test, each thread writes to a distinct memory location, and then reads from the memory location written by the other thread. In a sequentially consistent system that reordering is not possible. Store forwarding and store buffer absolutely affect the possible reorderings, which is why systems like x86 are often described semi-formally as "total store order with store buffering (forwarding implied)".Lyautey
@Lyautey Let's keep the discussion focused on store-load forwarding (SLF) only. Consider two systems, one that uses SLF and one that doesn't. SLF has no effect on the example from 8.2.3.4, so I'm not following what you're saying. Note that the question is only about SLF.Jilli
So you only want to compare a system with SLF and store buffer and another one with a store buffer but no SLF? You can't "just" talk about SLF since the possible reorderings come from both SLF and the presence of a store buffer and also other sources, so taking about SLF in complete isolation is meaningless. Anyways, even in that comparison, SLF causes additional reordering versus that which comes purely from a store buffer (see test 8.2.3.4).Lyautey
@Lyautey It seems to me that the question is specifically about the effects of SLF on memory ordering. I've not carefully thought about what happens when we compare with store buffer vs. without store buffer. You might be right about that. But my answer intentionally does not discuss that. "SLF causes additional reordering versus that which comes purely from a store buffer." How? Can you provide an example?Jilli
Yes, the example is without store to load forwarding (and assuming a strongish memory model that doesn't allow load-load reordering), two loads from any CPU will see stores to those locations in a consistent order. With store forwarding, a CPU that wrote one of those two locations can see their own store artificially early. There are many litmus tests that cover this. 8.2.3.5 (sorry meant to paste that above, not 8.2.3.4, but was swtiching mac & linux got copy key mixed up:)) in the Intel manual is one where each CPU sees its own load as coming first.Lyautey
@Lyautey I've got to go man, sorry. Let's continue the discussion later. I've not read your last comment or answer yet.Jilli
Other examples involve only one CPU doing forwarding and this CPU seeing stuff out of order with respect to what other CPUs see, but it's the same basic thing. The reordering caused by store-forwarding is confusing, because it doesn't fit into the usual 2x2 matrix of StoreLoad type orderings: for example, Intel says there is no load-load reordering, but the reordering caused by SLF looks exactly like load-load reordering (but you could also just say the stores were locally reordered).Lyautey
@Lyautey Your example seems to be consistent with everything written in my answer. Maybe I didn't understand it. Even if inconvenient, let's be very precise here. Let's use symbols to denote memory accesses and memory ordering.Jilli
Just look at the example 8.2.3.5 in the Intel guide or example n6 in the x86-TSO. Those are both examples caused by store-to-load forwarding. This is a reordering that wouldn't occur in most designs that were the same but didn't have SLF. It is a direct contradiction of the bolded part. That aside, maybe you should make your bolded part more precise: if you mean that "there exists a theoretical processor design without SLF which could exhibit the same reorderings that exist on a design with SLF", then sure - anything is possible!Lyautey
I certainly hand-wave design a CPU that did something weird, including purposely reordering stuff to satisfy any claim. In general, as a practical matter though, the 8.2.3.5 and n6 cases are directly caused by store-to-load forwarding.Lyautey
@Lyautey Nice. Well, it's time for me to fix it.Jilli
BTW, I think the allowed state in 8.2.3.5 is missing the condition r1 == 1 and r3 == 1 in addition to the r2 == 0 and r4 == 0 condition. The prose before and after the example implies it, and the example is kind of useless without it (it becomes the same as 8.2.3.4 except with an additional load whose result is never examined). The n6 test in the paper doesn't have this problem.Lyautey
@Lyautey I think they only say that r1 == 1 and r3 == 1 in that particular example. But in general, the load may get a value written by some other processor, not necessarily the one written by a store that immediately precedes it.Jilli
Well they don't say r1 == 1 and r3 == 1 at all unless my viewer is rendering it wrong (what do you see?). Of course other orderings are possible, including cases where the value read is not the value written in the immediately preceding instruction, but those aren't too interesting.Lyautey
K
2

A load is dispatched from the RS (Reservation Station) and goes through the AGU (Address Generation Unit) to the load buffer entry that was allocated for the corresponding ROB (Reorder Buffer) entry at the allocate stage. When the load buffer entry was allocated, it was coloured with the most recent SBID (store buffer ID) at that time. Coloured means the entry number (aka. ID) of the most recent store in the store buffer is inserted into the load buffer entry. The store buffer comprises the SAB (Store Address Buffer) and SDB (Store Data Buffer); each store has an entry in both (because each store is 2 uops, usually microfused) and they both have the same index (entry no aka. SBID).

I think once the address is valid, the valid bit in the entry then gets set, meaning they are ready to dispatch (and is cleared when the data is eventually written back to the ROB).

There is also a speculative memory disambiguation predictor which may be involved in the setting of the valid bit to indicate that it is predicted to not alias with any stores between the SBID that it is coloured with, and the tail pointer store in the store buffer (store address in the SAB and the data in the SDB). If it is predicted to alias, or does actually alias (I.e. It searches the store buffer for an address and uses the bitmask in the SAB to determine whether the entry can satisfy it (the bitmask states the privilege level of the bytes supervisor / non-supervisor), and uses the implied size from the opcode to get the range of addresses that are being stored to by the store operation. If it can be satisfied, it reads from the SDB entry), it does speculative store-to-load forwarding using the data in the SDB and inserts the data in the load buffer and the load is completed in the LB (Load Buffer), but does not retire from the LB. The store-to-load forwarding ensures that reads can never be reordered with older writes to the same location, because the read will always use store-to-load forwarding. I think all store addresses before an LFENCE's SBID need to be calculated before making a prediction on a store after and LFENCE.

If it is not predicted to alias, the load is dispatched (and the loads are always dispatched in strict order with respect to other loads, unless the load has a non temporal hit or is to USWC (Uncacheable Speculative Write Combining memory type) memory (although, unlike stores, it doesn't know whether or not it is USWC at this stage). The load goes to the dTLB (data TLB) / L1d (L1 data cache) in parallel.

At any time, when store addresses complete in the SAB with any SBID less than or equal (taking wrap around into account) to the coloured SBID of the load in question, it can invalidate the memory disambiguation prediction made, and the pipeline is flushed, because the pipeline is now either using stale data stored before the store with which it should have performed store-to-load-forwarding with, or it is using false store-to-load forwarding data from a store that it actually had no dependency with.

When the data is loaded in the designated physical destination register, the data becomes valid in the ROB. When the data in the ROB is valid and a retirement pointer is pointing to the entry, the load is no longer speculative and acquires a senior bit. The load can then retire from (be removed from) the LB if a bit is set that indicates all stores between the SAB tail pointer and the coloured SBID have had their addresses calculated. Unless it is a senior load instruction, in which case, it can now execute now that it is senior and has retired from the ROB.

LFENCE is dispatched to the load buffer and only executes (is sent to the L1d cache) when all previous uops have retired from the ROB and when all previous load instructions have retired from the ROB+LB (according to the instruction stream serialising properties it is claimed to have, it is probably retired in a cycle on its own rather than with 1 or 2 other instructions before it in the ROB in same cycle). Load instructions retire when the ROB tells them they can retire (no longer speculative) and the data fetched is valid and the load is no longer memory-speculative. LFENCE dispatches when it is at the tail of the load buffer and ROB (It cannot retire until all read buffers are globally visible. I think this means that it makes sure that any senior load instructions (instructions that execute after retirement from the ROB and when they get marked as senior) such as PREFETCH have allocated read buffers. Regular loads allocate read buffers and read their data and it becomes valid in the load buffer before they can be retired. Globally visible in this case means all previous read LFBs (Line Fill Buffers) have received globally visible notifications from the ring for the line (which could come before the read response containing the data, or could be packaged into the read response, which may mean that it has to wait for all reads to complete as opposed to be acknowledged) (of course, instructions that have retired from the MOB (Memory Order Buffer) already are globally visible as their data has returned, but senior load instructions might not have allocated read buffers yet or had them acknowledged to be globally visible) (this is similar to the definition of globally visible stores, where in response to an RFO (Read For Ownership), the global observation for the LFB likely comes in the notification that the core has permission (exclusive access) of the line and other cores have been invalidated, which will come before the actual data in the line to write is returned to the core, assuming that this will always be written back before responding to a snoop where it loses permission on the line). When LFENCE dispatches, the L1d cache treats it as a nop and it completes, retires in the ROB, becomes senior i.e. is removed from the LB and uops before it in the load buffer that were prevented from dispatching to the L1d cache are now allowed to be dispatched.

Global visibility of loads does affect cache coherence state of other cores, which I think is why LFENCE requires loads to be globally visible. A load miss in the core goes to the LLC (Last Level Cache) which has a snoop filter showing that only one other core owns the line. If 1>= cores own the line, then it needs to downgrade that core to an S state and cause it to write back modified data. The data written to LLC can then be returned to the requesting core with an S state and a globally visible notification. If a load miss in the core instead misses the LLC, the LLC might send a globally visible notification immediately while sending the request to the home agent to fetch it from memory (or if it is a multisocket system, the LLC has to wait for acknowledgement from the home agent that it does not need to snoop other cores before it can send the globally observable notification to the core).

I think a senior load is a load that is no longer speculative and is waiting for the data to be returned and become valid, or it is already valid so instantly retires, whereas a senior load instruction is an instruction that dispatches after it has been retired from the ROB.

Kirkpatrick answered 29/1, 2021 at 6:8 Comment(62)
What do SAB/SBD/SBID stand for? Also any further reading the coloring algorithm to invalidate mispredicted loads? Aka what does "SBID it is coloured with and the tail pointer store in the SAB/SBD" mean?Popery
@Popery I need to add references to this. I made the answer more legible for now. This model is almost complete but there are still unknowns and is still a work in progress, a combination of many patents and brainstorming – the only more that exists to do is microbenchmark or ask people at Intel. I should clarify what is known for certain and what part is guesswork. A lot of these details come straight from P6 patents but I essentially extrapolate and write in the context of the architecture of sandy bridge client or soKirkpatrick
Two other questions. Re: "loads are always dispatched in strict order with respect to other loads" are you saying if we have load AB, and address of A is not ready but address of B is ready B will not start checking dTLB / L1 until address of A is ready? Second. Re: "when store addresses complete in the SAB with any SBID less than or equal to the coloured SBID" what if the load is predicted to alias? IIRC it will serialize but not pipeline flush.Popery
@Popery okay I just read the answer and it took me a while to work out what I meant. I don't know what I'm going to get stuck on when reading the notes until I actually forget and need the notes myself, so only then do I know how the notes should have been... 1) wait 2) if it's predicted to alias with a store and it doesn't actually alias with the store, then uops that use the loaded data are falsely using the forwarded data from a store, so the pipeline needs to be flushed and the instructions after it reexecuted with the correct dataKirkpatrick
The flush may actually just be a ROClear and restart from the IDQ if the uops are still in the IDQ – I saw some patent on that; otherwise, it flushes and resteers the whole pipeline and restarts at the 'next IP' stage. I don't think it can restart from the IQ anymore now that some of the uops can also be from the uop cache.Kirkpatrick
Also you should just Google search -> tools ->verbatim and then type 'Intel patent <acronym>' and it will bring up the relevant patents for any acronym or acronym combination. 1) you're asking if it doesn't check TLB until all load addresses before it are ready and have been dispatched. This is correct. It would stall, it doesn't jump the pipeline. There's another part to this answer here for stores: https://mcmap.net/q/14956/-how-do-the-store-buffer-and-line-fill-buffer-interact-with-each-otherKirkpatrick
The benefit of the RS is that there could be a very long execution (like a load that misses the cache) that still has an entry, but operations before it can execute and retire from the RS and remain in the ROB waiting for 2nd retirement stage, meaning further instructions can execute. When a load dispatches from the RS it goes to the AGU, then the load buffer and prediction logic sends it to the TLB/L1d and the lookup happens in parallel (unlike with stores). When the data returns, it frees the entry in the RS, but not the ROB/LB. ROB/LB resources are retired when the load becomes seniorKirkpatrick
The uops may not dispatch from the RS in program order, because RS scheduler uses pLRU I think, but the LB entries were allocated in order by the allocator, so the logic that deals with the load buffer will enforce the correct order of load buffer entries entering the dTLB/L1d. I say benefit of the RS, but it's not a benefit, it's just one of the reasons why it pays off to have a smaller RS and not a 1:1 extension to the ROB.Kirkpatrick
Just thinking about it also, not all uops in the ROB actually go to the RS, zero idioms for instance, nop, sig event uops from the front end, and I think INTn is just an event uop which triggers a microcode routine. Also, only so many uops can be executing at one time anyway, so it makes sense to make the RS smaller. In the RS, the entries are marked with the ports they need to be dispatched on when they are allocated. I think each port has a scheduler.Kirkpatrick
Loads definitely go out of order on modern x86. The MOB enforces that the out of orderness isn't actually visible (e.g. that load-load reordering is enforced) by tracking the program ordering of the loads and taking a nuke if an invalidate is received that would make a mis-ordering visible. @PoperyLyautey
@Lyautey are you saying that the loads AB case the order than A and B go to L1 / dTLB can infact be reordered or something else?Popery
@Popery No he's saying that loads execute out of order (in the AGU) but because the corresponding entries in the MOB (SAB/SDB/LB) are allocated (and filled in with the ROB entry no.) by the allocator where the uops are in program order, the MOB knows the true order of the loads for when they're actually dispatched to the memory hierarchy, because they're in sequential LB entries. It's possible that LB entries are tagged with the destination PRF number and not the ROB entry (unlike fence or stores, which have no dest register) but then you'd need to have ROB no.tagged to PRF regs, ,seems a wasteKirkpatrick
@Popery - yes, that's what I'm saying. The loads can execute out of order, including the path to memory. That is, in the AB case the B load can have obtained its value ("completed") from somewhere in memory before A even starts. As long as the B line stays in the L1D until the load for A is complete, this reordering is not observable. I.e. B went first but we can guarantee its value did not change until A completed so there is no way to tell. This type of reordering is key to high performing strongly ordered CPUs like x86.Lyautey
@BeeonRope In your post on memory disambiguation did you run any tests to see if you could trick the predictor into pairing loads with the wrong stores? Aka LOAD0, STORE1, STORE0 where LOAD0 and STORE0 alias but STORE1 is independent. Did LOAD0 take data from STORE1 then machine clear? If not any idea of the mechanism that pairs loads to stores. Your writeup made it sound like loads alone are just predicted to alias/not-alias in that determined if they could hoist or not, nothing about pairing w/ stores.Popery
@Popery - I don't recall doing that (but it's possible I did). You are right that I treated the predictors as applying to loads, i.e., "can this load be hoisted" rather than to load-store pairs, i.e., "which stores might this load hit". It is possible the predictor can also predicts pairs, or there is a separate predictor for this, since it would improve performance to have this information and also open up opportunities such as "memory renaming" which in fact does appear in Ice Lake.Lyautey
The usual mechanism that pairs stores to loads is the store buffer itself. I.e., loads "search" the store buffer for the newest-but-still-older-than-the-load store which the load "hits" (i.e,. involve overlapping bytes). If there's a hit, the store is forwarded to the load (or if the hit is only partial, a slower path has to be taken).Lyautey
Now, the addresses for all preceding stores might not be known, so it might not be possible to determine, at the moment the load is ready to execute, whether there is a hit against an older store. The safe approach is to wait until there are no such stores left (all addresses are known), and the aggressive approach is to speculate that there are no hits and execute the load right now. Making that decision is where the predictor comes in.Lyautey
The way this "store buffer search" happens in practice is also interesting: it's going to be too slow to search entry by entry for a matching address so in practice there are some CAM-like structures, and to make this cheaper only a subset of the address is used, so you can get false hits and all sorts of other stuff. Check out the "Speculative Store Bypass" paper which describes in some detail how the store buffer is searched and ways to fool it.Lyautey
@Lyautey I see, so in essence if a store is predicted to alias whether you get a machine clear should be based on which is ready first, the address for the store or the data for the store? Or is this path optimized s.t even if address is ready its still going to speculatively take data from matched store? But something about this feels weird to me. I can buy that in state 2) when the predictor is in use loads are actively being matched to stores, but in state 3) seems like its the fall back when predictor has failed, wouldn't make sense to still allow machine clears on bad pairing.Popery
So with the following setup I am able to consistently get ~5 machine clears: STORE0, STORE1, LOAD0, NOP256, STORE1, STORE0, LOAD0. AFAICT this should be machine clear per iteration if state 3) wasn't serializing and instead just had the load take from predicted store.Popery
@Popery - there's a lot there and I'm not sure I got all of it. My writeup suggests that the predictor is trying to predict whether given loads will need forwarding, so I'm not sure what you mean my "if a store is predicted to forward". The model I lay out there predicts on a per-load basis, not a load-store basis if forwarding will occur. It doesn't try to predict which store will forward to the load, only if the load will be subject to forwarding at all.Lyautey
You mention store data and address, and so there are a few scenarios. If the store address is ready before the load address, no aliasing prediction is needed at all: both of the addresses are known so the need to forward or not can be determined "exactly" (as I mentioned, there are also cases of false forwarding e.g., because this determination uses a subset of bits, but this is different from what the predictor is trying to determine). This is sometimes called "4k aliasing". The use of the term aliasing is quite confusing here because it \Lyautey
it used both to refer to the case where a store (correctly) forwards to a load (the load aliases the store), and where you get false store forwarding, e.g., 4k aliasing, where the CPU initially things the store forwards to the load, based on a partial address match or address hash, but it actually does not forward. The predictors I'm talking about really deal with the first case and I didn't look into 4k aliasing much. I'll try to use "forward" rather than "alias" to avoid the ambiguity.Lyautey
Anyway, if the address of the store is known, the predictor is out of the picture: forwarding can be determined exactly (although it might get it wrong initially due to 4k or other false aliasing scenarios as mentioned). Here the forwarding occurs immediate if the store data is ready as well (this is the fastest scenario: a full roundtrip store->load-> takes only 3 cycles on SKL).Lyautey
If any older store address is not known (technically, older than the load but newer than any address-known store that does forward), then the load may or may not forward. Deciding whether it forwards or not is the job of the predictor. Note that the predictor doesn't really to know anything about the stores! It just needs to know "hey, has the load at this address been observed to forward in the past"? If the predictor says "sometimes forward" (even if this is very weak: forwarding 1 out of 1024 times is enough!), the load will wait. You don't need to predict which store it \Lyautey
forwards from, because you wait until the addresses are known and then the SB tells you as part of its usual operation which store forwards. Of course, you could imagine a more sophisticated predictor that also identified the likely forwarding store: this might let you forward earlier: when only that store was ready, you can forward, even if there are still earlier address unknown stores. Or, you could forward when the data was ready, even if the address wasn't. I didn't really find any evidence for or against that, since I wasn't looking for it.Lyautey
So to be clear, in state (2) I wasn't suggesting the predictor is matching loads to stores (not sure if that's what you suggesting). In fact, it's the opposite: state 2 is saying that some loads are OK to execute now without worrying about address-unknown stores because they historically haven't required forwarding at all. Not sure I can answer the rest of your observations since they seem to stem from a misunderstanding of how I suggest the states work. Let me know if this makes any sense.Lyautey
I think all loads receive a prediction if there are unknown addresses before the corresponding store in the store buffer (I.e. before the SBID coloured in the load buffer, probably wraps around right up until the entry after it), otherwise no prediction needs to be made because all store addresses before it are known, so it either forwards using the most recent instance of that load address in the store buffer, or not at all if that load address isn't in the store buffer at all. It may receive a prediction instantly after the AGU, or when it's about to be dispatched to the TLB and abortsKirkpatrick
So the SBID coloured on the load entry is the SBID of the most recent store allocated by the allocator in the store buffer at the time of the load entry allocation. The allocator is the one that colours the load entry with this SBID when it allocates it. This is then used later on by the predictor as described. There might be stores up to and including store with this SBID which haven't even been dispatched from the RS yet, or might still be in the AGU. If so, a store buffer search by the predictor will reveal this and therefore the predictor needs to make a prediction on the loadKirkpatrick
These loads will be marked as speculative whereas ones where no prediction needed to be made will not be marked as speculative. Like non speculative, Speculative predicted aliasing loads will then directly take the data out of the store buffer whereas (like non speculative) speculative predicted non aliasing proceed to the TLB. If the store data for the predicted aliased store is not ready then it probably stalls and then doesn't have to flush if the prediction is invalidated in that time. I'd suspect predicted aliasing SBIDs are also marked on the load buffer entriesKirkpatrick
And then, when all store addresses completes between the coloured SBID inclusive and the predicted aliasing SBID inclusive, AND the predicted aliasing SBID indeed has the same address as the one in the load buffer entry, then the prediction is verified and the load becomes non speculative, but is still speculative in the overall instruction flow (the definition of speculative I use in the answer), so really I mean memory-speculative, so there has to be another bit on the entry for thatKirkpatrick
The bit goes to 0 when it is valid, and then there's another bit for invalidation which is set by store addresses completing on the AGU. When store addresses complete from the AGU, the SBID of its store buffer entry is simply compared to the stated SBID range. If the address is in it, invalidate. If the address is not the same as the colored SBID, invalidate and trigger flush if the load is complete. If there are no more addresses left to complete in the SBID range, validate by removing speculative bit.Kirkpatrick
Also, the flush will be from (and possibly including) the load entry's ROB entry, and the instruction will resume (or restart, if it is also flushed). Now store to load forwarding starts again on that load entry, and it will of course use the new store buffer entry closer to it if that's what caused the invalidation, or it might now be going to the TLB if there are no SBIDs with that store address including and before the coloured SBID. It's unlikely that another prediction will need to be made at this point, as store addresses have had time to complete.Kirkpatrick
My guess would be that the load is flushed from the ROB and completely restarts at the IDQ or IP fetch. All stores addresses that complete are compared against load buffer entry SBID ranges that are marked as memory speculative and flush if it is marked as complete as well i.e. It has completed speculatively. Actually probably don't need to have an invalidate bit, just flush from that point. Also store addresses completing fill in their data into any entries that are predicted to alias with that SBID when the data wasn't readyKirkpatrick
I'm not sure how the entry gets validated when all store addresses in the range have completed addresses. I'm trying to think of some easy logic to that. Perhaps there's a check over the range every time a completed store address from the AGU hits the range. You still need a memory speculative bit, because even though the main speculative bit cleared means it is not memory speculative, you need store addresses completing in the AGU to only trigger flushes on entries that a memory disambiguation prediction were made on. It's implicit from 2nd SBID but this might be SBID 0, so you need extra bitKirkpatrick
Also maybe the load buffer is dynamically hooked up to the marked store buffer entries in the range, and like a state machine when all stores addresses in the range complete, it becomes valid. You've got to think about how that would be possible to be engineered in an electronic circuit, and think about the case when a store address completes in the same cycle that the store buffer is being searched when the prediction is made in the first place and its SBID is the only one in the range (coloured and predicted SBIDs are the same) of the load entry. What happens then.Kirkpatrick
@Lyautey Sorry, "store is predicted to forward" -> "load is predicted to forward". But re: "It doesn't try to predict which store will forward to the load, only if the load will be subject to forwarding at all". Yes that is what I thought. But that seems to runs contrary to what @LewisKelsey wrote in answer to my above question so I am trying to figure out how that fits with what you had written. (Sorry for the lack of clarity!)Popery
@Popery - I think what Lewis said and what I'm saying are consistent. To be clear, Lewis is using the "prediction" in a broader sense: not just to encompass the prediction of the memory disambiguation predictor that I discuss in my post, but also the result of searching the store buffer. The latter can also be called a prediction in a sense since it is non-exact (you can get false positives to to incomplete address matching and false negatives because of address-unknown stores).Lyautey
So what I am saying (and I think was Lewis is saying) is that first the "may alias?" disambiguation predictor is checked, keyed by the load IP. This predictor returns a result of either "doesn't alias 99.9% of the time" or "aliases 0.1% of the time or more" (note how conservative it is, and this is why it doesn't make sense for it to be tracking load-store pairs: it predicts "may alias" even in cases where the load rarely actually aliases, like when it aliases 1% of the time).Lyautey
If this predictor returns "this load doesn't alias" then the load can proceed even if there are prior address-unknown stores, and maybe you don't even search the store buffer at all (to save power). That's what Lewis is saying in the "setting of the valid bit to indicate that it is predicted to not alias..." part. Now if this predictor returns "may alias (at least 0.1% chance)", you need to search the store buffer. This may return a hit on the load address (Lewis describes this in detail), and this hit can also be seen as a prediction, since the hit is not exact (see the Spectre paper \Lyautey
I mentioned earlier for how the "loose net" and "fine net" etc operate). So even though the load gets data forwarded from the store, this might turn out to me wrong and a nuke is taken. Otherwise, there may be no hit in which case the load would stall until there are no longer address-unknown stores that might forward to this load. There are also stall scenarios when there is a forwarding hit but the data isn't ready, etc...Lyautey
I hadn't seen that Lewis had also added a bunch of comments, but I think we are mostly in agreement. There might be some small details which we might not agree on (e.g,. I'm not sure that forwards are only speculative when the predictor has allowed a load to hoist above address-unknown stores, because even a "hit" STLF is speculative in the sense that the matching is not exact). I'm not sure that matters anyway, b/c what use is this speculative bit? @PoperyLyautey
@Lyautey & LewisKelsey I really appreciate the time you guys spent on helping me figure this stuff out and sorry for the confusion earlier! Thank you!Popery
@Lyautey a byproduct of this might be that for say memcpy where there is false 4k aliasing (but no actual aliasing), if you could find a way so that the store addresses wouldn't be ready yet when the loads that alias with them search the SB, you could potentially bypass the stall (assuming the loads where predicted not to alias). Not sure of a practical way to implement it though.Popery
for the false-aliasing case I was able to get this to work. But it seems that if there is no false-aliasing the overhead required to delay AGU ends up becoming a bottleneck.Popery
@Noah, I am not sure that would work. Normally if a load goes ahead of an address unknown store, and then the store address later resolves and is found to forward, you would take a memory clear (nuke) which is very costly. I guess it would be the same for false 4k aliasing: why would store address coming second avoid the false aliasing? It all comes down to when the "deep" check is made that compares the full PA: unless the store hit load scenario goes right to the deep check I don't think this would work (it would be worse).Lyautey
One approach to avoiding 4k aliasing in memcpy is detecting it is likely to happen (based on the relative src/dst offsets) and then running the copy backwards. Another approach is unrolling the copy loop so much, loading into N registers then N stores, that the loads get "ahead" of the stores (modulo 4k) which also fixes it.Lyautey
@Lyautey Hmm, I was seeing a perf improvement with it in the aligning case, made a quick repo here. The way LewisKelsey made it sound was that if the load didn't find any stores to match with it would then use the aliasing prediction. If the prediction is to not alias and since 4k aliasing is not real aliasing, then there should be no machine clear and the load should be free to execute out of order. I do agree though that generally there are better solutions (I generally don't use backward though because it gets worse prefetch behavior).Popery
I think it is highly CPU dependent. The example in the README was done on my Tigerlake. As opposed to backward I think the best way to avoid 4k aliasing though is something along the lines of pipelining the loop as follows: load4567; loop_start; load0123; store4567; check_bounds_exit_to_store0123; load4567; store0123; check_bounds_exit_to_store4567. It works particularly well if there is computation for checking the loop boundary with say strcpy. And has better prefetch behavior if I understand Hadi's Answer Here.Popery
Maybe the check at load graduation is different than the check when it enters the LB and uses CAM to check the SB. Since the cost of a false-alias from CAM after graduation is a machine nuke (not just serialization) it would make sense to actually check the full address of the matches.Popery
"4k aliasing is not real aliasing, then there should be no machine clear" - that's what is not clear. If it was as simple as "not real aliasing" why would there be a problem in any scenario regarding 4k aliasing? Still you are right that the situation is asymmetric: normally 4k aliasing causes a stall of the load, not necessarily too serious, but a nuke is worse. So maybe before taking the nuke you do the full check, I'm not sure. It is certainly an interesting to delay the store address to avoid 4k aliasing.Lyautey
Yes, you can adjust the loads and stores which is what I was suggesting in "another approach" - however, this only goes so far (at least if I understand what you are suggesting): if you do 8 loads before 8 stores it works to avoid 4k aliasing if the store offset is up to 8 elements ahead of the load, but what if it is more?Lyautey
@Lyautey where you able to reproduce the benchmark? I found the amount of delay necessary was super volatile. Probably is different if you are on something pre ICL with only 4uop pipeline width (probably don't one of the and statements). But re:"what if it is more?" well the same way you check for aliasing and if you have it go backwards, you check if (dst - src) % 4096 < 128, if so go to 8x pipelined mode, else do normal memcpy. Its not a generic memcpy, its just imo the best way to handle the aliasing case.Popery
Volatile relative to length (Aka I needed less ALU for copy length = 4096 than copy length = 2048), not volatile in the sense that a working configuration was not reproducible on a given machine. But only played around with this on my Tigerlake and imagine different tuning will be necessary for different CPUs. One thing though is that the 4k aliasing case is easy to handle for the DRAM limited memcpy with page interleaving so should not actually depend heavily on the exact model of a specific micro-arch because L1/L2 times are standardized to clock cycles.Popery
@Lyautey Not seeing any machine nukes when running this benchmark. I imagine due to the speedup there must be some cases where the loads are going out of order with the stores. I would imagine from a design perspective, machine nukes are very expensive so it would make sense to actually verify the full address returned by CAM. Whereas serializing a single load might be considered acceptable loss of the inprecision.Popery
@Popery - thanks for the info, yeah I guess it makes sense to do a full verify here. It would suck to take the nuke then find out it was false aliasing. About "what if there are more", I was thinking of the scenario where the dest is far enough ahead of the source that you can't buffer the difference in registers, yet maybe you'd still get 4k aliasing at that distance. E.g., 16 vectors ahead. If the stores commit "quickly" it seems this wouldn't be a problem: you'd have only a few stores in the store buffer, but if they start to back up there could be ~50 stores sitting around potentially \Lyautey
causing 4k aliasing? I did run into something like this in the past, when I was using this same approach to avoid 4k aliasing (buffer more loads in the case 4k aliasing was likely to happen) but I ended up just changing the buffer allocations to avoid the problem that way so never really needed to fully flesh out the "4k aliasing avoidance" stuff. I haven't run your benchmark yet...Lyautey
@Popery - I ran your benchmark. The 4k avoidance "worked" sometimes (it was intermittent) but was still slower than the version w/o avoidance. See this gist, for some more comments. There is a perf event that shows you the 4k aliasing count as shown there. The IPC of these tests is high, close to 4, so it is possible the ALIAS_SAFE = 1 case gets limited by some throughput limit, or a store limit (tiger lake can do 2 stores/cycle, while this SKL only one).Lyautey
@Lyautey hmm, would have to guess its too much ALU. On SKL with only 1 load port OOE should be able to start the subsequent loads before the store even get to the AGU. On the other hand maybe the 4-uop execution width makes ALU the bottleneck for the SKL version. Obviously this idea is not practical to implement in a real memcpy, but thought it was a cool concept nonetheless. Re:"I was thinking of the scenario where the dest is far enough ahead of the source". If (dst - src) % 4096 is large enough you don't have enough registers then you won't get any 4k aliasing.Popery
You only need to be able to handle the edge case where 4k aliasing would be an issue which should really only be occuring when it is feasible to pipeline (16 vectors gets you 512 bytes ahead. Generally I see 4k aliasing dying down around 256-384). Also generally think pipelining is the preferred solution just because according to Hadi's Research prefetchers favor forward access patterns.Popery
"then you won't get any 4k aliasing" - how do you know though? As I mentioned, you get aliasing when there are stores in the store buffer which alias in their bottom bits. Since the store buffers have 40+ entries, that's a lot of possible stores in the SB, more than the number of ymm registers. Now maybe for whatever reason the store buffer doesn't end up backing up (it almost certainly doesn't in your test b/c it's just repeatedly copying the same 4k region so draining will be quick as it's all L1D hits). For larger regions the SB may back up.Lyautey
I'm not saying you're wrong, just that it is plausible that 4k aliasing can happen over relatively long distances because the SB is large.Lyautey

© 2022 - 2024 — McMap. All rights reserved.