There are a number of buffers in the L1 cache.
This patent gives the following buffer types:
- Snoop buffers (buffers that service M/E state snoops from other cores (read / RFO))
- Writeback buffers (buffers that service M state evictions from L1)
- Line fill buffers (buffers that service cacheable load/store L1 misses)
- Read buffers (service L1 read misses of cacheable temporal loads)
- Write buffers (service L1 write misses of cacheable temporal stores)
- Write combining line fill buffers (not sure, appears to be the same thing as a write combining dedicated buffer in this patent)
- Dedicated buffers (buffers that service uncacheable loads/stores and are 'dedicated' for the purpose of fetching from memory and not L2 (but still pass the request through L2), and don't fill the cache line)
- Non write combining dedicated buffers (services UC loads/stores and WP stores)
- Write combining dedicated buffers (services USWC loads/stores)
The patent suggests these can all be functions of the same physical buffer, or they can be physically separate and there is a set of buffers for each function. On Intel, the 12 LFBs on Skylake might be all there are and the logical functions are shared between them with a type or state field. On some embodiments, the line fill buffers can also handle USWC loads/stores. On some embodiments, dedicated buffers can handle cacheable non-temporal (NT) loads/stores that miss L1 (such that they do not 'fill' the L1d cache, like the name implies, taking advantage of the NT hint to prevent cache pollution).
'Write combining buffer' here implies USWC memory / non-temporality and inherent weak ordering and uncacheability, but the actual words 'write combining' does not imply any of these things, and could just be a concept on its own where regular write misses to the same store buffer are squashed and written into the same line fill buffer in program order. A patent suggests such functionality, so it is probable that regular temporal write buffers that aren't marked WC probably have a combining functionality. Related: Are write-combining buffers used for normal writes to WB memory regions on Intel?
The x86-64 optimisation manual states (massive giveaway):
On a write miss to the first-level cache, it allows multiple stores to the same
cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the
returned line. Store ordering and visibility are also important issues for write combining. When a write to a write-combining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO). If a subsequent write happens to another write combining buffer, a separate RFO may be caused for that cache line. Subsequent writes to the first cache line and write-combining buffer will be delayed until the second RFO has been serviced to guarantee properly ordered visibility of the writes. If the memory type for the writes is write-combining, there will be no RFO since the line is not cached, and there is no such delay.
This is blatant evidence of the term 'write combining buffer' being used to describe regular write buffers that have a purely the combining ability, where strong ordering is maintained. We also now know that it's not just non-temporal stores to any memory that allocate write combining buffers, but all writes (because non-temporal stores do not issue RFOs). The buffer is used to combine writes while a RFO is taking place so the stores can be completed and store buffer entries can be freed up (possibly multiple if they all write to the same cache line). The invalid bits indicate the bits to merge into the cache line when it arrives in E state. The LFB could be dumped to cache as soon as the line is present in cache and all writes to the line after that either write directly to the cache line, or it could remain allocated to speed up further reads/writes until a deallocation condition occurs (e.g. it needs to be used for another purpose or an RFO arrives for the line, meaning it needs to be written back to the line)
So it seems like nowadays, all buffers can be any type of logical buffer and all logical write buffers are write-combining buffers (unless UC) and the cache type determines the way the buffer is treated in terms of weak/strong ordering and whether RFOs are performed or whether it is written back to the cache. The cache type in the LFB which either comes from the TLB (which acquires the cache type from the PMH, which analyses the PTE, PAT MSRs and MTRR MSRs and calculates the final cache type), or the SAB (Store Address Buffer) after buffering the result of a speculative TLB lookup.
So now there are 6 types of buffers:
- Write combining LFB (WB write miss / prefetch)
- Read LFB (read miss / prefetch from anywhere other than UC and USWC)
- Write combining dedicated buffer (WP write, WT write miss, USWC read/write, NT read/write to anywhere other than UC)
- Dedicated buffer (UC read/write)
- Snoop buffer
- Eviction writeback buffer
These buffers are indexed by physical address and are scanned in parallel with the L1 cache and, if they contain valid data, can satisfy read/write hits faster and more efficiently until they are deallocated when a deallocation condition occurs. I think the '10 LFBs' value refers to the number of buffers available for the first 2 purposes. There is a separate FIFO queue for L1d writebacks.
Let's not forget the cache type order of precedence:
- UC (Intel E bit)
- USWC (PAT)
- UC (MTRR)
- UC (PAT)
- USWC (MTRR) (if combined with WP or WT (PAT/MTRR): either logical and or illegal: defaults to UC)
- UC- (PAT)
- WT WP (PAT/MTRR) (combining MTRRs in this rank result in logical and of the memory types; combining MTRR and PAT on this rank results in logical and (Intel); AMD (illegal:UC))
- WB (PAT/MTRR)
MTRR here includes the default type where a range is not mapped by an MTRR. MTRR is the final type that results from the MTRRs having resolved any conflicts or defaults. Firstly, defaults are resolved to UC and rank the same as any UC MTRR, then any MTRRs that conflict are combined into a final MTRR. Then this MTRR is compared with the PAT and the E bit and the one with the highest precedence becomes the final memory type, although in some cases, they are an illegal combination that results in a different type being created. There is no UC- MTRR.
Description of cache types (temporal):
- UC (Strong Uncacheable). Speculative reads and write combining are not allowed. Strongly ordered.
- UC- (Weak Uncacheable) the same as UC except it is a lower precedence UC for the PAT
- USWC (Uncacheable Speculative Write Combining) speculation and write combining are allowed. Reads and writes are not cached. Both reads and writes become weakly ordered with respect to other reads and writes.
- WT (Write Through) reads are cacheable and behave like WB. WT writes that hit the L1 cache update both the L1 cache and external memory at the same time, whereas WT writes that miss the L1 cache only update external memory. Speculative reads and write combining are allowed. Strongly ordered.
- WP (Write Protect) reads are cacheable and behave like WB. Writes are uncacheable and cause lines to be invalidated. Speculative reads are allowed. Strongly ordered.
- WB (Write Back) everything is allowed. Strongly ordered.
Description of cache types (non-temporal):
- NT UC no difference (UC overrides)
- NT USWC no difference to USWC I think
- NT WT I would think this behaves identically to NT WB. Seems so.
- NT WP I'm not sure if WP overrides NT hint for writes only or reads as well. If it doesn't override reads, then reads presumably behave like NT WB, most likely.
- NT WB In the patent at the top of the answer, NT reads can hit L1 cache and it uses a biased LRU policy that reduces pollution (which is something like forcing the set's tree PLRU to point to that way). Read misses act like USWC read misses and a write combining dedicated buffer is allocated and it causes any aliasing lines in LLC or other cores or sockets to be written back to memory before reading the line from memory and reads are also weakly ordered. It is implementation specific as to what happens on modern intel CPUs for NT WB reads -- the NT hint can be completely ignored and it behaves like
WB
(see full discussion). Write hits in L1 cache in some implementations can merge the write with the line in the L1 with a forced PLRU such that it is evicted next (as WB), alternatively a write hit causes an eviction and then a write combining dedicated buffer is allocated as if there were a miss, which is written back as USWC (using WCiL(F)
) on the deallocation condition. Write misses allocate a dedicated write combining buffer and it is written back to memory as USWC when deallocated, but if that miss results in a L2 hit, the write combining buffer is written to L2 immediately or on a deallocation condition and this either causes an immediate eviction from L2 or it forces the PLRU bits so it is the next eviction. Further reads/writes to the line continue to be satisfied by the buffer until it is deallocated. NT Writes are weakly ordered. A Write hit in L1/L2 that isn't in an M/E state may still result in a WiL
to invalidate all other cores on the current and other sockets to get the E state, otherwise, it just invalidates the line and when the USWC store is finally made, the LLC checks to see if any other cores on the current or a remote socket need to be invalidated.
If a full USWC store (opcode WCiLF
) hits in the LLC cache, the Cbo sends IDI invalidates (for some reason invalidate IDI opcode (as part of egress request in the IPQ logical queue of the TOR) sent by Cbo is undocumented) to all cores with a copy and also always sends a QPI InvItoE
regardless of whether there is a LLC miss or not, to the correct home agent based on SAD interleave rules. The store can only occur once all cores in the filter have responded to the invalidation and the home agent has also; after they have responded, the Cbo sends a WrPull_GO_I
(which stands for Write Pull with globally observed notification and Invalidate Cache Line) of the data from L2 and sends the data to home. If a partial USWC store WCiL
hits in the LLC cache, the same occurs, except if the line is now modified in the LLC slice (from a SnpInv
it sent instead of an invalidate if the line was only present in one core -- I'm guessing it does do this and doesn't just send plain invalidates for WCiL
like it does for WCiLF
) or was modified in the LLC all along, the Cbo performs a WBMtoI
/WbMtoIPtl
to the home agent before performing a write enable bit writeback WcWrPtl
for the USWC store. PATs operate on virtual addresses, so aliasing can occur, i.e. the same physical page can have multiple different cache policies. Presumably, WP write and UC read/write aliasing also has the same behaviour, but I'm not sure.
The core superqueue is an interface between L2 and L3. The SQ is also known as the 'off core requests buffer' and any offcore request is known as any request that has reached the SQ. Although, I believe entries are allocated for filling the L2 on a L1 writeback, which isn't really a 'request'. It therefore follows that OFFCORE_REQUESTS_BUFFER.SQ_FULL
can happen when L1D writeback pending FIFO requests buffer is full, suggesting that another entry in the SQ cannot be allocated if that buffer is full, suggesting that entries are allocated in the SQ and that buffer at the same time. As for a LFB, on a L2 hit, the data is provided directly to the LFB, otherwise on a miss, if allocates a SQ entry and is provided to the LFB when the fetched data from both 32B IDI transactions is written into the SQ. A further L2 miss can hit the SQ and is squashed to the same entry (SQ_MISC.PROMOTION
).
An RFO intent begins at the store buffer and if it hits the L1d cache in an M or E state, the write is performed and the RFO ends. If the line is in an I state, a LFB is allocated and the RFO propagates to L2, where it can be satisfied there if present in an M or E state (when a M line is written back to L2, it becomes an M state there with respect to L3). If it is an I state / not present, it is allocated in the SQ and an RFO
or ItoM
packet propagates to the corresponding LLC slice Cbo that handles the address range. The Cbo slice then invalidates other cores, using the snoop filter, which involves sending invalidate requests to cores (or snoop invalidates (SnpInv
), if it is only present in one core -- which get the data as well, because the Cbo does not know whether this is modified or not). The Cbo waits until it receives acknowledgements of the invalidation from the cores (as well as the data if modified). The Cbo then indicates to the SQ of the requesting core that it now has exclusive access. It likely acknowledges this early because the Cbo may have to fetch from the memory controller, therefore it can acknowledge early that the data is not present in any other core. The SQ propagates this information to the L1d cache, which results in a globally observed bit being set in the LFB and the senior store can now retire from the SAB/SDB to free up its entry. When the data eventually arrives, it is propagated to the LFB, where it is merged into the invalid bits and then it is written to the cache upon a deallocation condition for that address or due to LFB resource constraints.
If a WB line is present in L1 but in an S state, it may or may not allocate a LFB to merge stores before the line can be written to. If it is invalid / not present in L1, an LFB is allocated to merge stores. Then, if the line is present in L2 but is in an S state, a WiL
packet is sent to the LLC slice (it only needs to invalidate other cores). It then informs the SQ of the requesting core that it now can transition it to an E state. This information is propagated to the L1d cache where the LFB can now be merged into the cache before a deallocation condition occurs for that address of LFB resource constraints.
ItoM
is used instead of an RFO when it's assumed that the full line is going to be written to so it doesn't need a copy of the data already in the line, and it already has the data if it's in any other state (S, E, M). A theoretical StoI
i.e. a WiL
is the same thing as an RFO, same for E, all except for I, where ItoM
and RFO differs in that the LLC doesn't need to send the data to the core for an ItoM
. The name emphasises only the state changes. How it knows the whole line is going to be written to by stores I dont know.. maybe the L1d cache can squash a bunch of sequential senior stores in the MOB all at once while it allocates a LFB, because the RFO is sent immediately upon allocation I thought (and then retires them all once the RFO arrives). I guess it has some further time for stores to arrive in the LFB (L2 lookup) before the opcode has to be generated. This also might be used by rep stos
.
I'm assuming RFO IDI packets don't need to distinguish between demand lock RFO, prefetch RFO, demand regular RFO (non-prefetch), to correspond with the Xeon 5500 core events, but might for priority purposes (prioritise demand traffic over prefetch), otherwise only the core needs to know this information, this is either encoded in an RFO
or there are separate undocumented opcodes. PrefRFO
is sent by the core for prefetching into LLC.
L1i ostensibly lacking fill buffers implies the main benefit of the fill buffer is a location to store and combine stores and have store buffer entries free up more quickly. Since L1i does not perform any stores, this isn't necessary. I would have thought that it does have read LFBs still so that it can provide miss data while or before filling the cache, but subsequent reads are not sped up because I think the buffers are PIPT and their tags are scanned in parallel with the cache. Read LFBs would also squash reads to point to the LFB and prevent multiple lookups, as well as prevent the cache from blocking by tracking current misses in the LFBs MSHRs, so it's highly likely this functionality exists.