How do the store buffer and Line Fill Buffer interact with each other?
Asked Answered
H

2

12

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the micro-architectural details of the exploit.

One thing that isn't clear to me after reading that question is why we need a Line Fill Buffer if we already have a store buffer.

John McCalpin discusses how the store buffer and Line Fill Buffer are connected in How does WC-buffer relate to LFB? on the Intel forums, but that doesn't really make things clearer to me.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....

I've just recently started reading up on out-of-order execution so please excuse my ignorance. Here is my idea of how a store would pass through the store buffer and Line Fill Buffer.

  1. A store instruction get scheduled in the front-end.
  2. It executes in the store unit.
  3. The store request is put in the store buffer (an address and the data)
  4. An invalidate read request is sent from the store buffer to the cache system
  5. If it misses the L1d cache, then the request is put in the Line Fill Buffer
  6. The Line Fill Buffer forwards the invalidate read request to L2
  7. Some cache receives the invalidate read and sends its cache line
  8. The store buffer applies its value to the incoming cache line
  9. Uh? The Line Fill Buffer marks the entry as invalid

enter image description here

Questions

  1. Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?
  2. Is the ordering of events correct in my description?
Hoffert answered 9/4, 2020 at 20:34 Comment(5)
An LFB can be tracking an incoming cache line, not just a store. An LFB buffers between the L1d and the L2 or off-core. The store buffer buffers between execution and L1d (or off-core for NT stores). Some of the description of having data in an LFB waiting to merge with an RFO result doesn't fully make sense; we're not sure CPUs actually do anything like that. i.e. Dr. Bandwidth's mental model (at the time he wrote that specific post) might not match reality there. @ BeeOnRope, @ HadiBrais, and I have debated what does/doesn't make sense for that in previous SO Q&As, IIRCMcchesney
@PeterCordes Since each store is preceded by an RFO and since data from upper levels is stored in the LFBs, isn't it possible that the SB "writes" into the relative LFB? I.e. not using it as a temporary buffer while doing the RFO but writing into it after the RFO has bring data into it. Now, if the line the store would go to is already in EX state then I'm not sure an LFB is allocated. That seems a waste w.r.t. writing in the data lines directly but maybe the cache CAM doesn't allow for partial writes. Anyway, do we already have a canonical answer of SB <-> LFB interaction here?Commemorate
@MargaretBloom: IIRC, the main difficulties with this idea of committing from the SB into an LFB before it's architecturally allowed (memory ordering) to commit to L1d is that multiple stores to the same line lose memory-ordering info relative to each other (and anything else). We must maintain in-order stores even for code that alternates stores to two different lines. In Exclusive or Modified state there's no reason to expect an LFB to be involved in committing from SB to L1d, and before we reach that state it needs to stay in the SB for ordering. IDK if we have a canonical Q&A.Mcchesney
@PeterCordes Why would we want to commit stores to LFB before it's architecturally allowed? I was reasoning about the possibility of the SB to write to the LFB after the RFO brought the line into the LFB and before saving it in the cache's CAM. So this all happens after the core is sure the store is architecturally allowed.Commemorate
@MargaretBloom: Oh, now I see what you were saying. That sounds plausible and would be legal because the RFO is finished; we just have to make sure the store data shows up when responding to other cores. We already want to make sure we get a chance to commit at least one store before giving up the line again. So yes maybe we save on cache write ports by committing pending store(s) from the head of the SB into the LFB as the data arrives, maybe while the cache is indexing the right set/way to store the LFB. We do know that NT stores can write straight into an LFB, not cache, they're connectedMcchesney
B
19

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?

The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache2. The store buffer conceptually is a totally local thing which doesn't really care about cache misses. The store buffer deals in "units" of individual stores of various sizes. Chips like Intel Skylake have store buffers of 50+ entries.

The line fill buffers primary deal with both loads and stores that miss in the L1 cache. Essentially, it is the path from the L1 cache to the rest of the memory subsystem and deals in cache line sized units. We don't expect the LFB to get involved if the load or store hits in the L1 cache1. Intel chips like Skylake have many fewer LFB entries, probably 10 to 12 (testing points to 12 for Skylake).

Is the ordering of events correct in my description?

Pretty close. Here's how I'd change your list:

  1. A store instructions gets decoded and split into store-data and store-address uops, which are renamed, scheduled and have a store buffer entry allocated for them.

  2. The store uops execute in any order or simultaneously (the two sub-items can execute in either order depending mostly on which has its dependencies satisfied first).

    1. The store data uop writes the store data into the store buffer.
    2. The store address uop does the V-P translation and writes the address(es) into the store buffer.
  3. At some point when all older instructions have retired, the store instruction retires. This means that the instruction is no longer speculative and the results can be made visible. At this point, the store remains in the store buffer and is called a senior store.

  4. The store now waits until it is at the head of the store buffer (it is the oldest not committed store), at which point it will commit (become globally observable) into the L1, if the associated cache line is present in the L1 in MESIF Modified or Exclusive state. (i.e. this core owns the line)

  5. If the line is not present in the required state (either missing entirely, i.e,. a cache miss, or present but in a non-exclusive state), permission to modify the line and the line data (sometimes) must be obtained from the memory subsystem: this allocates an LFB for the entire line, if one is not already allocated. This is a so-called request for ownership (RFO), which means that the memory hierarchy should return the line in an exclusive state suitable for modification, as opposed to a shared state suitable only for reading (this invalidates copies of the line present in any other private caches).

    An RFO to convert Shared to Exclusive still has to wait for a response to make sure all other caches have invalidated their copies. The response to such an invalidate doesn't need to include a copy of the data because this cache already has one. It can still be called an RFO; the important part is gaining ownership before modifying a line.

  6. In the miss scenario the LFB eventually comes back with the full contents of the line, which is committed to the L1 and the pending store can now commit3.

This is a rough approximation of the process. Some details may differ on some or all chips, including details which are not well understood.

As one example, in the above order, the store miss lines are not fetched until the store reaches the head of the store queue. In reality, the store subsystem may implement a type of RFO prefetch where the store queue is examined for upcoming stores and if the lines aren't present in L1, a request is started early (the actual visible commit to L1 still has to happen in order, on x86, or at least "as if" in order).

So the request and LFB use may occur as early as when step 3 completes (if RFO prefetch applies only after a store retires), or perhaps even as early as when 2.2 completes, if junior stores are subject to prefetch.

As another example, step 6 describes the line coming back from the memory hierarchy and being committed to the L1, then the store commits. It is possible that the pending store is actually merged instead with the returning data and then that is written to L1. It is also possible that the store can leave the store buffer even in the miss case and simply wait in the LFB, freeing up some store buffer entries.


1 In the case of stores that hit in the L1 cache, there is a suggestion that the LFBs are actually involved: that each store actually enters a combining buffer (which may just be an LFB) prior to being committed to the cache, such that a series of stores targeting the same cache line get combined in the cache and only need to access the L1 once. This isn't proven but in any case it is not really part of the main use of LFBs (more obvious from the fact we can't even really tell if it is happening or not).

2 The buffers that hold stores before and retirement might be two entirely different structures, with different sizes and behaviors, but here we'll refer to them as one structure.

3 The described scenarios involves the store that misses waiting at the head of the store buffer until the associated line returns. An alternate scenario is that the store data is written into the LFB used for the request, and the store buffer entry can be freed. This potentially allows some subsequent stores to be processed while the miss is in progress, subject to the strict x86 ordering requirements. This could increase store MLP.

Briannebriano answered 10/4, 2020 at 15:27 Comment(11)
You said that step 4 (send an invalidate request) happens later, when the store is ready to commit. The concepts of retire/commit are new to me. Is this the right sequence of events: 1. The store uop executes in the store execution unit 2. It gets placed in the store buffer 3. The store uop is in the retirement units Reorder Buffer (ROB) until it is known to be not-speculative 4. The store buffer sends the invalidate read request (this may take some time, but since the store buffer keeps track of the req, the store don't have to wait around) ...to be continued..Ventose
Steps 5-7 in my question happens. Then the store buffer applies its value and thus it commits.Ventose
@DanielNäslund - I've created my own list, take a look and see if makes sense. WRT to your question, I believe the store buffer entry is actually allocated at rename, which happens even before execution (the uops enter the scheduler at rename). The buffer entry is basically empty at this point and then separate "address" and "data" uops fill those into the buffer entry when they execute. After retirement, one could think of the store buffer operating in order: stores are committed to L1 one at a time in the order they appear in the source (this is a requirement of the strong memory ...Briannebriano
ordering on x86, where stores to WB memory are forbidden from reordering). However, there may be optimization to that simple one-at-a-time model in that the store system may "look ahead" to pending stores and start getting those lines early. So the miss is not necessarily handled at an exactly specified moment, but rather a range of times which may also depend on the specific CPU, heuristics/predictors checking whether RFO prefetch has been helping out in practice, etc.Briannebriano
I added a bit to entry 5 about RFOs for lines that were present but shared (not owned exclusively)Mcchesney
Thanks @PeterCordes, forgot about that scenario, and now I added even a bit more.Briannebriano
Do we know if a cache-line split store still only needs one SB entry that just takes extra cycles to commit? As you say, it's totally local and doesn't really care about L1d cache misses so this seems possible. I guess this could be tested with an experiment that where the SB filling prevented OoO exec of some dep chains before/after a block of stores. If so, I wonder if we could test for coalescing in the SB by using split stores to get work into the SB faster than it can commit to hot L1d lines, mixing split stores with a pair of dword stores to halves of a qword.Mcchesney
@PeterCordes - I would guess it only takes one SB entry: since those are allocated at rename, before you know it is split and it just needs to hold the data and address and then at commit the split is dealt with? It would be very easy to test based on a modification of robsize test 33 (just change the stores to be split).Briannebriano
@BeeOnRope: yeah, SB allocation was another part of my thinking that I forgot to type out. I wondered if page-split stores might be so extra expensive because of maybe having to alloc another SB entry, or just what the mechanism is; maybe each entry has room for two physical addresses in case of a page split?Mcchesney
@PeterCordes - good point about the two physical addresses. Let me modify test 33...Briannebriano
Asked over here.Briannebriano
P
0

When the uops reach the allocator, in the PRF + Retirement RAT scheme (SnB onwards), the allocator consults the front end RAT (F-RAT) when necessary for a rename for the ROB entries (i.e. when a write to an architectural register (e.g. rax) is performed) it assigns to each uop at the tail pointer of the ROB. The RAT keeps a list of free and in use physical destination registers (pdsts) in the PRF. The RAT returns the physical register numbers to be used that are free and then the ROB places those in the respective entries (in the RRF scheme, the allocator provided to the RAT the pdsts to be used; the RAT was unable to select because the pdsts to be used were inherently at the tail pointer of the ROB). The RAT also updates each architectural register pointer with the register it assigned to it i.e. it now points to the register that contains the most recent write data in the program order. At the same time, the ROB allocates an entry in the Reservation Station (RS). In the event of a store it will place a store-address uop and a store-data uop in the RS. The allocator also allocates SDB (store data buffer) / SAB (store address buffer) entries and only allocates when all the required entries in the ROB / RS / RAT / SDB / SAB are available

As soon as these uops are allocated in the RS, the RS reads the physical registers for its source operands and stores them in the data field and at the same time checks EU writeback busses for those source PRs (Physical Registers) associated ROB entries and the writeback data as they are being written back to the ROB. The RS then schedules these uops for dispatch to the store-address and store-data ports when they have all their completed source data.

The uops are then dispatched -- the store address uop goes to the AGU and the AGU generates the address, transforms it to a linear address and then writes the result into the SAB. I don't think a store requires a PR at all in the PRF+R-RAT scheme (meaning that a writeback doesn't need to occur to the ROB at this stage) but in the RRF scheme ROB entries were forced to use their embedded PR and everything (ROB / RS / MOB entries) were identified by their PR nos. One of the benefits of the PRF+R-RAT scheme is that the ROB and hence the maximum number of uops in-flight can be expanded without having to increase the number of PRs (as there will be instructions that do not require any), and everything is addressed by ROB entry nos in case the entries do not have identifying PRs.

The store data goes directly through the store converter (STC) to the SDB. As soon as they are dispatched, they can be deallocated for reuse by other uops. This prevents the much larger ROB from being limited by the size of the RS.

The address then gets sent to the dTLB and it then stores the physical tag output from the dTLB in the L1d cache's PAB.

The allocator already allocated the SBID and the corresponding entries in the SAB / SDB for that ROB entry (STA+STD uops are microfused into one entry), which buffer the results of the dispatched execution in the AGU / TLB from the RS. The stores sit in the SAB / SDB with a corresponding entry with the same entry no. (SBID), which links them together, until the MOB is informed by the retirement unit of which stores are retirement ready i.e. they are no longer speculative, and it is informed upon a CAM match of a ROB entry retriement pointer pointing to a ROB index/ID that is contained in the SAB / SDB entry (in a uarch that can retire 3 uops per cycle, there are 3 retirement pointers that point to the 3 oldest unretired instructions in the ROB, and only the ROB ready bit patterns 0,0,1 0,1,1 and 1,1,1 permit retirement pointer CAM matches to go ahead). At this stage, they can retire in the ROB (known as 'retire / complete locally') and become senior stores and are marked with a senior bit (Ae bit for STA, De bit for STD), and are slowly dispatched to the L1d cache, so long as the data in the SAB / SDB / PAB is valid.

The L1d cache uses the linear index in the SAB to decode a set in the tags array that would contain the data using the linear index, and in the next cycle uses the corresponding PAB entry with the same index value as the SBID to compare the physical tag with the tags in the set. The whole purpose of the PAB is to allow for early TLB lookups for stores to hide the overhead of a dTLB miss while they are doing nothing else waiting to become senior, and to allow for speculative page walks while the stores are still actually speculative. If the store is immediately senior then this early TLB lookup probably doesn't occur and it is just dispatched, and that is when the L1d cache will decode the tag array set and look up the dTLB in parallel and the PAB is bypassed. Remember though that it can't retire from the ROB until the TLB translation has been performed because there might be a PMH exception code (page fault, or access / dirty bits read need to be set while performing the page walk), or an exception code when the TLB needs to write through access / dirty bits it sets in a TLB entry. It is entirely possible that the TLB lookup for the store always occurs at this stage and does not perform it in parallel with the set decode (unlike loads). The store becomes senior when the PA in the PAB becomes valid (valid bit is set in the SAB) and it is retirement ready in the ROB.

It then checks the state of the line. If it is a shared line, the physical address is sent to the coherence domain in an RFO (always RFO for writes), and when it has ownership of the line, it writes the data into the cache. If the line isn't present then a LFB is allocated for that cache line and the store data is stored in it and a request is sent to L2, which will then check the state of the line in L2 and initiate a read or RFO on the ring IDI interface.

The store becomes globally visible when the RFO completes and a bit in the LFB indicates it has permission to write the line meaning that the LFB will be written back coherently upon the next snoop invalidation or eviction (or in the event of a hit, when the data is written to the line). It is not considered globally visible when it is just written to the LFB before the fetching of the line in the event of a miss goes ahead (unlike senior loads that do retire on a hit or when a LFB is allocated by the L1d cache), because there may be other RFOs initiated by other cores which might reach the LLC slice controller before the request from the current core, which would be a problem if a SFENCE on the current core had retired based on this version of 'globally visible' retirement -- at least this provides a synchronisation guarantee for inter-processor interrupts. Globally visible is the very moment where that stored data will be read by another core if a load happens on another core, not the moment where it will be after a small duration where before that the old value will still be read by other cores. Stores are completed by the L1d cache upon allocation of a LFB (or when they are written to the line in the event of a hit) and retire from the SAB / SDB. When all previous stores have retired from the SAB / SDB, this is when store_address_fence (not store_address_mfence) and its associated store_data_fence can be dispatched to the L1d. It is more practical for LFENCE to serialise the ROB instruction stream as well, whereas SFENCE/MFENCE do not because it would potentially cause a very long delay in the ROB for global visibility and is not necessary, unlike senior loads which retire instantly, so it makes sense why LFENCE was the fence chosen the serialise the instruction stream at the same time. SFENCE/MFENCE do not retire until all LFBs that were allocated become globally visible.

A line fill buffer can be in 1 of 3 modes: read, write or write combining. The purpose of the write line fill buffer I think is to combine multiple stores to the same line's data into the LFB and then when the line arrives, fill in the non-valid bits with the data fetched from L2. It may be at this stage that it is considered completed and therefore the writes are satisfied in bulk and a cycle earlier, rather than waiting for them to be written into the line. As long as it is now guaranteed to be written back to the cache in response to the RFO of another core. The LFB will probably remain allocated until it needs to be deallocated, allowing for slightly faster satisfying of subsequent reads and writes to the same line. A read line buffer can service read misses a few cycles quicker, because it is instantly available in the line fill buffer but takes longer to write it to the cache line and then read from it. A write combining buffer is allocated when the memory is a USWC type and allows writes to be satisfied immediately and flushed to a MMIO device all at once rather than having multiple core->PCIe transactions and having multiple PCIe transactions. The WC buffer also allows speculative reads from the buffer. Typically speculative reads are not allowed on UC memory because the read could change the state of the MMIO device but also the read/write takes so long that by the time it completes, it will no longer be speculative and is therefore perhaps not worth the extra traffic? A LFB is likely VIPT (VIPT/PIPT are the same on intel, and the V is a linear address on intel); I suppose it could have the physical and the virtual tag to eliminate further TLB lookups, but would have to negotiate something for when the physical page migrates to a new physical address.

Pavyer answered 26/1, 2021 at 8:51 Comment(4)
LFENCE also has to order any potentially weakly-ordered loads (like SSE4.1 movntdqa), and one easy way to do that is to make sure all previous loads are retired. (Because on x86, that means fully completed, not still waiting for a value.) It seems possible that Intel was considering the idea of weakening the x86 memory model when they introduced LFENCE, or something like that.Mcchesney
@PeterCordes I thought only NT stores are weakly ordered. I will be writing an answer on the model I have for the load hardware at some point if I can find a relevant question.Pavyer
movntdqa loads from WC memory are weakly ordered. But unlike NT stores, movntdqa doesn't override the memory-type attribute. I wrote Non-temporal loads and the hardware prefetcher, do they work together? a while ago; I have a half-finished update pointing out that NT-aware HW prefetchers would be needed for that to possibly work, and there's a conflict between truly bypassing cache for loads vs. maintaining coherency (or extra HW needed). Also note that regular stores to WC memory are also weakly ordered, they're exactly like NT stores.Mcchesney
@PeterCordes Oh yes I remember now, I think I mentioned it in an answer I wrote on write combining buffers a while ago and there was somethng relating to that that I didn't resolve but might be able to now. It was also discussed here: https://mcmap.net/q/15076/-what-is-the-meaning-of-quot-non-temporal-quot-memory-accesses-in-x86Pavyer

© 2022 - 2024 — McMap. All rights reserved.