Does software prefetching allocate a Line Fill Buffer (LFB)?

Asked 19/10, 2013 at 22:54 Answered 24/12, 2017 at 8:21

caching assembly 64-bit bandwidth prefetch

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.

A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.

The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.

But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.

So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

Merthiolate answered 19/10, 2013 at 22:54 Comment(3)

Very interesting read. Did you manage to get further with this question? Some observations. I reach 16GB/s per core on a Sandy Bridge laptop (21GB/s peak). Also, I have tested a Nehalem E7-4870 with a peak bandwidth of 34GB/s per socket. On a single core I am able to achieve only 5GB/s, although there are 10 fill buffers on both architectures. Using your model, this would suggest a 3 times larger latency of 200ns, which I think is not realistic. Latencies should be comparable on the two architectures. – Trimester 6/2, 2014 at 16:17

@angainor: Many-core Xeons do have much higher latency in the uncore, and consequently much lower single-core L3/memory bandwidth than laptop/desktop chips (see the "latency bound platforms" section of this answer for details + links). Being multi-socket hurts too: it has to snoop the cache of the other socket before loading from local DRAM. Where Xeons are great is aggregate memory bandwidth from many threads. – Broth 19/7, 2017 at 17:30

I think Skylake has bumped the number of LFBs from 10 to 12. – Incurable 22/11, 2018 at 21:58

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.

In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:

                     Benchmark   Cycles    Nanos
  16-KiB parallel        loads     0.50     0.19
  16-KiB parallel   prefetcht0     0.50     0.19
  16-KiB parallel   prefetcht1     1.15     0.44
  16-KiB parallel   prefetcht2     1.24     0.48
  16-KiB parallel prefetchtnta     0.50     0.19

  32-KiB parallel        loads     0.50     0.19
  32-KiB parallel   prefetcht0     0.50     0.19
  32-KiB parallel   prefetcht1     1.28     0.49
  32-KiB parallel   prefetcht2     1.28     0.49
  32-KiB parallel prefetchtnta     0.50     0.19

 128-KiB parallel        loads     1.00     0.39
 128-KiB parallel   prefetcht0     2.00     0.77
 128-KiB parallel   prefetcht1     1.31     0.50
 128-KiB parallel   prefetcht2     1.31     0.50
 128-KiB parallel prefetchtnta     4.10     1.58

 256-KiB parallel        loads     1.00     0.39
 256-KiB parallel   prefetcht0     2.00     0.77
 256-KiB parallel   prefetcht1     1.31     0.50
 256-KiB parallel   prefetcht2     1.31     0.50
 256-KiB parallel prefetchtnta     4.10     1.58

 512-KiB parallel        loads     4.09     1.58
 512-KiB parallel   prefetcht0     4.12     1.59
 512-KiB parallel   prefetcht1     3.80     1.46
 512-KiB parallel   prefetcht2     3.80     1.46
 512-KiB parallel prefetchtnta     4.10     1.58

2048-KiB parallel        loads     4.09     1.58
2048-KiB parallel   prefetcht0     4.12     1.59
2048-KiB parallel   prefetcht1     3.80     1.46
2048-KiB parallel   prefetcht2     3.80     1.46
2048-KiB parallel prefetchtnta    16.54     6.38

The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.

More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:

 Performance counter stats for './uarch-bench --test-name 512-KiB parallel   prefetcht1':

        38,345,242      branches                                                    
     1,074,657,384      cycles                                                      
       284,646,019      mem_inst_retired.all_loads                                   
     1,677,347,358      l1d_pend_miss.fb_full

The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.

The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:

There are cases where a PREFETCH will not perform the data prefetch. These include:

...

If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.

Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).

We also note the following interesting behaviors:

It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.

Could it be different?

Here's an older answer I wrote before testing, speculating on how it could work:

In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB¹. So SW and HW prefetches that target the L1 probably both use LFBs.

However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.

That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht2². Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):

Generally, software prefetching into the L2 will show more benefit than L1 prefetches. A software prefetch into L1 will consume critical hardware resources (fill buffer) until the cacheline fill completes. A software prefetch into L2 does not hold those resources, and it is less likely to have a negative perfor- mance impact. If you do use L1 software prefetches, it is best if the software prefetch is serviced by hits in the L2 cache, so the length of time that the hardware resources are held is minimized.

This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).

¹ In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.

² I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

Incurable answered 24/12, 2017 at 8:21 Comment(17)

You think NT stores use L1D on the way to a LFB? Intel describes them as using an LFB as the write-combining buffer, so I think that means stores go straight into the LFB. I think load execution units are connected directly to the LFB as well as L1D to support early-restart when data arrives to satisfy a demand-load (also for movntdqa loads from USWC memory), so direct connection to store units seems likely, too. – Broth 25/12, 2017 at 19:24

If you write an entire cache line with back-to-back NT stores like you're supposed to, then yeah the LFB occupancy could be short if it really hands off to the superqueue instead of going straight from the LFB to the memory controller. Otherwise the LFB stays occupied until other demand for LFBs flushes a partially-written LFB. – Broth 25/12, 2017 at 19:28

@PeterCordes - were you referring to footnote 1? It was a typo - I had intended to write "use the LFBs" not "use the L1". Still the topic is interesting. I don't know what happens if, for example, you write only part of a cache line with NT stores and it eventually gets flushed: does it have to wait to read the remaining (unwritten) bytes so it can send down a full 64-byte line (since LFBs are usually moving full lines) or is there a special mode where you can effectively do a masked NT write of only some bytes in a line? – Incurable 30/12, 2017 at 0:27

Yeah, I was. Good question. I forget what I've read about this. Presumably the memory system has some way to represent non-full-line stores for uncacheable regions, so it can probably do that unless that only works for power-of-2 sizes. IDK if the memory controller turns it into a RMW or if it can do a partial burst store to the DDRx DRAM. – Broth 30/12, 2017 at 0:31

@PeterCordes I don't think there is any path direct from the LFB to the memory controller: I think those lines can only go "further" if they use the SQ: even at a physical layout level it would be awkward to have a whole separate path from L2 (the end of the LFB path) all the way to the memory controllers, when you have the SQ doing the same thing. So NT stores are special in this respect compared to normal load and store misses in L1 which all need to hold an LFB for the duration of the miss, and so the 10 LFBs are the limiting factor. – Incurable 30/12, 2017 at 0:33

Since NT stores don't need to hold the LFB, other factors become the bottleneck (perhaps the SQ) leading to the odd situation where the pure write ("memset") bandwidth with NT stores could be higher than any other combination of reads and writes. You also see that the pure store loads benefits the most from NT stores. For example, on my system I get about ~27 GB/s for memset but less than 12 GBs for memcpy (meaning less than 24 bandwidth GB/s when you consider the 1 read and 1 write in memcpy). – Incurable 30/12, 2017 at 0:35

Relaying through the SQ makes sense. It makes sense to me that each core would have one "external" communication link, not separate physical links for L3 -> LFB (prefetchnta) and LFB <-> memory (NT store, and movntdqa from USWC memory) for those special instructions. – Broth 30/12, 2017 at 1:13

@PeterCordes - FWIW, although the original version was a pure typo, I think NT stores do interact with the L1 in as much as a store to a location whose corresponding line is dirty in L1 needs to fill in any non-written bytes from L1 before flushing it. Yes, there are looser ordering guarantees wrt NT stores, but not so lose that they could just ignore data in the same line in L1 (and anyways they are documented to flush those lines). So probably on LFB allocation for an NT store the L1 is probed and if present the LFB is populated with the L1 contents, or perhaps it happens at flush. – Incurable 30/12, 2017 at 1:21

Right, NT stores mixed with regular stores are still observed by the current thread to happen strictly in program order. IDK how much it sucks for performance to mix NT and non-NT stores to the same line. IDK if it can just move the L1D line into the LFB; you can't turn regular stores into NT stores by doing that on a Modified line; that could result in them being reordered. I wouldn't be surprised if dirty L1D data has to write-back before an LFB can be allocated for the NT store. – Broth 30/12, 2017 at 1:28

OTOH, Intel's vol3 manual says something about not using NT stores on the same line as a semaphore, but it doesn't explain whether that's a performance or correctness issue, or under what conditions there could be a real problem. So possibly an NT store can turn an already-dirty L1D line into NT stores. – Broth 30/12, 2017 at 1:29

Putting L1 data into the LFB doesn't necessarily seem problematic to me - you are totally inside the "private" cache coherence domain of the core here, and evictions of lines in different states happen basically at arbitrary times already. It is up to the cache coherency mechanism to make this all work, but I don't see a particular problem with loading up the LFB with L1 data in response to an NT store. Perhaps NT stores also go through the store buffer (that seems to be the easiest way to keep them in-order for the current core when mixed with regular stores). – Incurable 30/12, 2017 at 1:35

Good point. I was thinking that normal write-back might hold onto the line in Modified state until it was actually written back, but that makes no sense. You want to free up the cache line ASAP to do whatever triggered the eviction. I guess LFBs are also part of the cache-coherency domain that includes all caches in all cores, at least when they're holding data being written back. Or does L2 allocate a line in M state first? But I've heard NT stores described as "not coherent". Maybe that just means "not ordered", or maybe LFBs holding NT data don't respond to RFOs from other cores. – Broth 30/12, 2017 at 1:45

Yes, I had another reply that I didn't post but it was along the lines of: i.e., my belief is that the non-ordering of NT stores wrt to regular stores is mostly a "close to the core" issue regarding the way they sit in the write combining buffers (LFBs): they probably aren't probed by snoops from other cores, and they don't order themselves with the RFOs done by local stores (that aren't already locally M). Once the line becomes globally visible (e.g., hits the L3) I think it is coherent in the same way as other non-NT operations. – Incurable 30/12, 2017 at 1:52

The primary basis for this belief is that MFENCE is capable of ordering local stores wrt NT stores, and MFENCE is a relatively cheap local operation. If NT stores were to escape the core in a totally unordered/not-coherent way, how could an MFENCE ever fix that? It seems "too late" if say NT stores occurred before an MFENCE, and if they occurred after, they would still need to be coherent to respect ordering with stores prior to the mfence. So I conclude they are "mostly coherent" with the not-coherent part being close to the core. – Incurable 30/12, 2017 at 1:56

Good point about MFENCE / SFENCE. Yes, it pretty much has to be a local thing, that's what I was already thinking. But it's still a complication to have dirty non-NT data in the same LFB as NT-store data. If it's not already requesting write-back to L2, then maybe that LFB sits outside the coherency domain? I wouldn't be surprised if there's no support for having an LFB both coherent / ordered and waiting to accept a full buffer of NT data before flushing itself. – Broth 30/12, 2017 at 2:1

Ignoring NT stores for a moment, I basically I think of the whole L1 + LFB + L2 as a single unit (the "private cache") for the purposes of concurrency. It is that thing as a whole which makes external requests to lines and which must respond to snoops from other cores. If a core has a line in the M state, and a snoop comes in from another core, and the line is "in the middle" of being evicted from L1 to L2, what happens? One option is that the line might only exist in the LFBs, which probably implies that the LFBs must be probed ... – Incurable 30/12, 2017 at 2:7

... another option is that the line must stay in the L1 until it is committed to L2, which means the LFBs don't need to be probed. Other options include that the snoop just waits long that this transient condition will always finish or that the L2 includes enough info about what lines are in L1 so that this condition can be detected. In most of these cases, having an LFB sitting there with a mix of stores doesn't seem problematic: in option 1 the LFBs are probed, in option 2 non-NT data is still there in L1, so it works, etc. The timing based ones don't work though! – Incurable 30/12, 2017 at 2:12

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.

The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).

However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:

Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page

The important part is -

The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load reques

This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.

As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:

There are cases where a PREFETCH will not perform the data prefetch. These include:

PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.

An access to the specified address that causes a fault/exception.

If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.

...

So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

Showoff answered 20/10, 2013 at 18:5 Comment(2)

Thanks for the details! I agree that L2 and L3 Hardware Prefetch (HPF) do not use LFB's -- this is what gets us from 8 GB/s to 16 GB/s. I don't know if L1 HPF uses LFB's. Yes, Intel seems to document that NTA and T0 Software Prefetch do go through LFB to load L1. But what about T1 and T2? Seems like they wouldn't, yet benchmarks don't agree. Maybe there is another limiting factor on concurrency? Yes, the goal would be to use SWP to increase bandwidth --- but I haven't managed to get any significant benefit. The key would seem to be hinting a load from Mem to L3/L2 without using LFB's. – Merthiolate 20/10, 2013 at 19:52

Regarding L1 HW prefetch - I see in the same document that one of the conditions for issuing them is that Not many other load misses are in progress. Perhaps this indicates that they also compete on the same LFBs, so this is used as means for stress reduction. – Showoff 20/10, 2013 at 19:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Could it be different?

Recommended topics

Hot tags