Difference between PREFETCH and PREFETCHNTA instructions
Asked Answered
S

1

17

The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution.

So what does PREFETCHNTA do which is different from the PREFETCH instruction?

Shellieshellproof answered 12/11, 2018 at 21:33 Comment(0)
F
18

prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)

On paper, the x86 ISA doesn't specify how it implements the NT hint. http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.


prefetchNTA from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.

What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See Do current x86 architectures support non-temporal loads (from "normal" memory)?

On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).


@HadiBrais commented on this answer with some info on AMD CPUs.

Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.


Footnote 1: prefetchNTA from WC memory I think prefetches into an LFB (Line Fill Buffer), allowing SSE4.1 movntdqa loads to hit an already-populated LFB. (movntdqa loads from WC memory do work by pulling data into an LFB, according to Intel. That's how multiple movntdqa loads on the same "cache line" can avoid multiple actual DRAM reads or PCIe transactions). See also Non-temporal loads and the hardware prefetcher, do they work together? - no, not HW prefetch.

But note that movntdqa from WB memory is not useful. It just works like an ordinary load (plus an ALU uop for some reason).

Farinose answered 12/11, 2018 at 23:3 Comment(25)
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.Infallibilism
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.Infallibilism
According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.Infallibilism
In general, in any processor with an inclusive L3, it must be at least fetched into the L3. But it could also go into the L1 but not L2 or L2 but not the L1.Infallibilism
The third potential property of prefetchnta (other than quick eviction and selective cache level filling) is that the prefetched line may not be written back to another cache level (L3). This behavior is discussed in the AMD manual but not in the Intel manual.Infallibilism
@HadiBrais: my source for SKX behaviour of skipping L3 is Do current x86 architectures support non-temporal loads (from "normal" memory)?. SKX changed the L3 cache: it's smaller and no longer inclusive.Farinose
@PeterCordes, So if my system has a non inclusive L3 cache but an inclusive L2 cache, prefetch nta would prefetch the data directly in L2 cache,right?Shellieshellproof
@AbhishekNikam: Do you have an AMD CPU? Intel doesn't make CPUs with an inclusive L2, it's usually non-inclusive non-exclusive.Farinose
No @PeterCordes, that was a hypothetical statement just to make sure that I exactly understand what prefetchnta does.Shellieshellproof
@AbhishekNikam: oh, on paper there's no guarantee exactly what it does. All you get is what felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects. AMD made a significantly different choice than Intel, apparently; using "quick eviction" marking instead of limiting to one way. (Probably allocating in the LRU position)Farinose
Regarding fetching into the L1, according to Section 7.3.2, this only happens on non-Xeon processors (like you said, skips the L2 and fetches into the L3 and L1, with fast replacement). But on Xeon processors, the line is only fetched into the L3 (with fast replacement), not in L1 or L2.Infallibilism
Yes,it makes complete sense, prefetchnta should be used when we want to avoid cache misses but also use the cached data few amount of times.The implementation is completely vendor dependent.Shellieshellproof
Doesn't SKX falls under the category of server (Xeon) processors? In that case, I think Bee's answer you linked conflicts with Section 7.3.2. Bee does not seem to be sure (from the wording in the answer) that the L3 is skipped on SKX. Although the manual could be wrong.Infallibilism
From the manual: "Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer microarchitectures: must fetch into 3rd level cache with fast replacement."Infallibilism
@HadiBrais: Yes, SKX = Skylake-SP = Skylake-AVX512. The manual is out of date. Mysticial's experiments on his SKX found that you get an L3 miss if the data is evicted from L1d before you get around to demand-loading it, unlike on earlier CPUs where you'd get an L3 hit. How to properly use prefetch instructions?. That makes perfect sense because SKX's L3 is not inclusive anymore.Farinose
Mysticial's comment doesn't say that the line will not get filled into the L3 on SKX. It just says that if it got evicted before being used, it will be evicted from all cache levels. This seems to suggest that it gets fetched into all cache levels, which doesn't sound right. The most recent version of the manual says that on Xeon processors the line must be filled into the L3 with fast replacement, but remains ambiguous about whether it's filled into the L1 or L2. Yes the L3 is not inclusive, but that doesn't necessarily mean that the manual is wrong or outdated.Infallibilism
Unless I'm missing something, I don't see anyone saying "my experiments confirm that the L3 is skipped on SKX" and I think the manual could be right. BTW, Mysticial's comment would not contradict the manual.Infallibilism
@HadiBrais: On re-reading what I linked, yeah it's not that clear. Maybe there was another discussion somewhere else with more definite results / better evidence for my interpretation of the data that I was remembering. Like maybe Non-temporal loads and the hardware prefetcher, do they work together?. I still think that SKX not allocating in L3 at all is the most likely interpretation of the data, but I agree something else might be possible.Farinose
My understanding is that on all Xeon processors since Nehalem, the line will definitely be fetched into the L3 with fast replacement but may or may not be fetched into the L1 or L2 (the decision could be made dynamically by the way). In fact, Mysticial's comments don't definitively say in which level(s) the line gets filled. Hopefully he will see these comments and clarify.Infallibilism
If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...Infallibilism
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.Infallibilism
@PeterCordes what does LFB stand forMessily
@zerocool: Line Fill Buffer. A google search for x86 lfb finds a bunch of useful stuff. But thanks for pointing out that I missed defining the acronym, I do normally try to do that. Updated.Farinose
@PeterCordes do you know of any way to prefetch a code block into the L1 icache?Bohn
@Noah: There are a few ways, but perhaps not useful for overall performance. (But possibly for creating conditions for microbenchmarks as in Bring code into the L1 instruction cache without executing it - see my 2 answers there). The answers on How can I prefetch infrequently used code? can at best get code into L2 cache. (And prime the dTLB, not iTLB.)Farinose

© 2022 - 2024 — McMap. All rights reserved.