How to use software prefetch systematically?
Asked Answered
D

4

5

After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.

In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch intrinsic. I would like to know:

  • How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility perf. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
  • How I can locate the loads that suffer from cache misses the most?
  • How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
  • Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses
for (int i = 0; i < n; i++) {
   // some code
   double x = a[i];
   // some code
}

Should I place prefetch before or after the load a[i]? How far ahead it should point a[i+m]? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache? Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

Disposed answered 14/5, 2022 at 21:30 Comment(1)
Prefetching sometimes helps when there is quite much processing of the fetched data that does saturate the CPU pipeline, like a bunch of ALU operations. The CPU may not see the upcoming fetch operation in its processing window. Prefetching gives the CPU a hint to load next cache line in parallel. Don't expect huge gains. Something like 10% at most.Parvati
G
4

How can I see that software prefetch can help for my specific program?

You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.

When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.

How I can locate the loads that suffer from cache misses the most?

Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).

Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.

How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?

This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.

An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.

Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?

This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...

The usual solution is to write a code like this:

for (int i = 0; i < n; i++) {
   // some code
   const size_t magic_distance_guess = 200;
   __builtin_prefetch(&data[i+magic_distance_guess]);
   double x = a[i];
   // some code
}

Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).

The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.

Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?

Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).

Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.

In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.


Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.

Greenhead answered 14/5, 2022 at 23:5 Comment(7)
HW prefetchers can be disabled via MSRs (for the whole core). Usually only useful for microbenchmark experiments, although perhaps some very specific workloads benefit from selectively disabling some of them. (e.g. possibly the L2 spatial prefetcher if most accesses don't have spatial locality outside the single cache line. Especially if it's causing destructive interference between independent pairs of threads sharing adjacent cache lines.)Antimonic
SW prefetch is usually not terrible if you do it carefully, and can actually be a bit faster if you have the prefetch distance tuned right for the machine you're running on, with its current load conditions. I definitely don't recommended it for looping over a contiguous array, but with an unrolled loop so you're only executing one prefetch per line it's not much overhead. And I've seen a few examples of people claiming a minor speedup like 10% IIRC on modern CPUs like I think Zen, maybe also some Intel.Antimonic
But yeah, the key point here is that SW prefetch is almost never the right answer to lots of cache misses looping over an array. It can at best speed things up a few percent by helping the HW prefetchers start in the next page at page boundaries, or not at all. Or at worst, turns many misses into hits by slowing down your loop enough to not ask for data before it arrives. Much much better is cache-blocking if you can manage it. See What Every Programmer Should Know About Memory?Antimonic
@PeterCordes Indeed for disabling the prefetechers, though it require root privileges. Good to know. I removed this part in the answer. For SW prefetch, my experience with it is not great. Most of the time it resulted in no performance change or even slow-down. The very rare case where it was faster (only with NTA so far) required a very fine tuning of the code which was in the end not portable anymore... Not to mention the performance improvement is generally very small (except in very-rare pathological cases like the provided one).Imbecilic
Yeah, not something I've tried to play with, since tuning is brittle. But HW prefetchers aren't as magical as we might wish. e.g. they stop at the end of a 4k region of physical addresses. (L2 never even sees virtual addresses, and contiguous virtual might not be contiguous physical). The "next page prefetcher" Intel bragged about being new in IvyBridge is I think just a TLB prefetch, not triggering prefetch of the data. So out-of-order exec of loads early enough is still important to get stuff started across a 4k boundary. Fortunately with AVX doing lots per uop, CPU sees far.Antimonic
@PeterCordes When I saw that Intel developers themselves encourage people not to use SW prefetches but instead to tune the code to work with the HW prefetcher, I mostly stopped trying to use SW prefetching. A good example was for the 2D matrix transposition which indeed benefit a lot from cache-blocking (and register tiling). I also tried to check if the use of SW prefetching in libraries like Numpy and some research projects was justified, but on my (i5-9600KF & older ivybridge) machines I saw no performance decrease in removing them in benchmark made so far.Imbecilic
I've seen GCC use SW prefetch with an AMD tuning, I think, but not Intel. Maybe only Bulldozer-family, though, not Zen. Current GCC 12.1 -march=bdver3 will still use SW prefetcnta 400 bytes ahead (godbolt.org/z/4475Mf7f7), but I don't know how justified that is. -march=znver2 doesn't try to SW prefetch. So yeah, maybe just older AMD that didn't have smart enough SW prefetch.Antimonic
F
3

How to use software prefetch systematically?

The quick answer is: don't.

As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.

You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.

Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.

Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.

Flap answered 14/5, 2022 at 21:43 Comment(3)
It's not always a better algorithm. I've seen dramatic performance gains by inverting the nesting order of 2D array index loops.Masjid
@meaning-matters: Doing operations in a different order could still be considered an algorithmic change, even if it's motivated by cache access pattern. By contrast, just adding prefetching is purely a micro-optimization that doesn't need any work to prove correctness is maintained. Loop inversion does, even though in some cases it's something an optimizing compiler can do for you. (But don't count on it; most compilers that look for loop inversion only do it to "break" / "defeat" one of the SPECcpu2006 benchmarks, and may not find it in cases that aren't almost identical.)Antimonic
@PeterCordes Loop-inversion, among other optimisations, is not what people generally mean when they say that one should choose the most optimal algorithm. Your comment could therefore be seen as a micro-optimisation of my statement ;-)Masjid
W
1

As mentioned by other answers, the management of SW prefetcher requires a significant amount of manual effort and is difficult to generalize across different systems and workloads. The HW prefetcher on modern CPUs has made sufficient progress and can recognize different memory access patterns.

Although this paper is somewhat old, [1] from 2012 extensively discusses HW prefetcher and SW prefetcher, including your questions. The author claim that SW prefetcher is suitable for scenarios such as short arrays, continuous and irregular reads.

Interestingly, there are still many patterns in modern systems that cannot be well recognized by HW prefetchers, such as point chasing. And if it is multi-get or there is certain computational latency in the task, you can use SW prefetch to hide memory access latency. For example, [2,3] presents a relatively general design solution that uses coroutines to overlap computation with SW prefetching to hide data read latency.

Especially when accessing memory that is not fast or bandwidth-sufficient like DRAM, the advantages of SW prefetcher will further increase. Additionally, HW prefetchers may even impact the performance of components other than cache through incorrect prefetched data.


  • [1] Lee, Jaekyu; Kim Hyesoon; Vuduc Richard (2012). "When Prefetching Works When It Doesn't And Why". ACM Transactions on Architecture and Code Optimization (TACO) 9(1): 1-29.
  • [2] He Yongjun; Lu Jiacheng; Wang Tianzheng (2020). "CoroBase: Coroutine-Oriented Main-Memory Database Engine". Proceedings of the VLDB Endowment 14(3): 431-444.
  • [3] Jonathan, Christopher, et al. "Exploiting coroutines to attack the" killer nanoseconds"." Proceedings of the VLDB Endowment 11.11 (2018): 1702-1714.
Wert answered 22/9, 2023 at 4:24 Comment(5)
SW prefetch isn't easily usable in a tree with actual pointers, since nodes can be scattered around. The address of the left grandchildren (for example) isn't reliably computable without fetching the left child and reading its pointer members. So pointer chasing in trees isn't a good example of where SW prefetch helps. That changes with an implicit tree, where all nodes are in an array and the children of node k are at 2*k + 0..1. Then you can simply compute node addresses, and SW prefetch both nodes that might be needed after loading+comparing the left child, for example.Antimonic
@PeterCordes I get your points, there was a problem with my expression. What I meant to say is that point chasing can lead to CPU stall. Therefore, in the scenario of multi-get, software prefetching can help hide these delays.Wert
Memory-level parallelism and miss-under-miss caches have been a thing for decades. Especially with out-of-order exec, code that iterates over two linked lists in parallel (p = p->next; q = q->next; inside a loop) has no need for software prefetching; both demand-loads can be outstanding at once.Antimonic
SW prefetch would only help if you didn't iterate the other tree or linked list until after significant processing on this one (or for a single tree where you do a lot of work on each node), and / or don't want to worry about it maybe being NULL if you haven't yet checked other stuff. (But branch prediction and speculative execution normally do a good job of letting later loads get started. And on some microarchitectures, like (older?) x86, SW prefetch of a null pointer is/was expensive, like a TLB miss + page walk, so if it even helps, there's extra cost to amortize over the list length.)Antimonic
To understand how to software prefetch optimally, one would need to understand how the hardware prefetchers work in a level of detail that is not currently available on Intel/AMD/NVIDIA processors. (1) sites.utexas.edu/jdm4372/2023/04/25/…, (2) ixpug.org/images/docs/ISC23/…Anagoge
A
0

GCC has a -fprefetch-loop-arrays option, but I wouldn't recommend it for general use, only as an experiment while microbenchmarking a specific loop. (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fprefetch-loop-arrays)

Some -mtune=whatever options may enable that, for CPUs where it's known that the HW prefetchers can use the help (and that front-end bandwidth is high enough to handle the extra throughput cost of running prefetch instructions without usually slowing things down much, especially if data is already hot in L2 or L1d cache.)

There are some --param tuning parameters, like --param prefetch-minimum-stride=n that could limit it to only generating prefetch instructions when the pointer increment is multiple cache lines or something. (Hardware prefetchers in modern x86 CPUs can handle strided access patterns, although HW prefetchers don't typically work across 4K boundaries, since contiguous virtual pages might not map to contiguous virtual pages. Out-of-order exec can generate demand loads into the next page, which is often good enough.)

See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - SW prefetch is often not worth it.

Antimonic answered 22/9, 2023 at 15:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.