When should we use prefetch?
Asked Answered
R

6

11

Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.

I want to know, in practice, when should we use prefetch? Are there some examples?

Rogelioroger answered 20/12, 2013 at 5:54 Comment(5)
It's hard to get a boost from manual prefetching due to the existence of hardware prefetch. But here's an example of where it works: https://mcmap.net/q/15370/-prefetching-examplesStratigraphy
The first comment of the first answer there is genuinely reassuring: "no performance difference". :)Consist
@Consist Oh yeah, I have tested the program on a Xeon 24 core CPU, enable prefix use 1.31 seconds and disable prefix use 1.32 seconds, no noticeable differenceRogelioroger
@oakad, KaiWen, that comment in the linked question was wrong, deducing from a single case that prefetches aren't useful in general is bad logic. I think my example below would have worked for him as well.Acetyl
For a full answer on when prefetching is useful and associated tradeoffs, see my survey paper on cache prefetching techniques.Doable
A
14

This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.

Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where you're bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.

The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:

#include "stdio.h"
#include "sys/timeb.h"
#include "emmintrin.h"

#define N 4096
#define REP 200
#define ELEM int

int main() {
    int i,j, k, b;
    const int blksize = 64 / sizeof(ELEM);
    ELEM __attribute ((aligned(4096))) a[N][N];
    for (i = 0; i < N; ++i) {
        for (j = 0; j < N; ++j) {
            a[i][j] = 1;
        }
    }
    unsigned long long int sum = 0;
    struct timeb start, end;
    unsigned long long delta;

    ftime(&start);
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j ++) {
                sum += a[i][j];
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); 

    ftime(&start);
    sum = 0;
    for (k = 0; k < REP; ++k) {
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; j += blksize) {
                for (b = 0; b < blksize; ++b) {
                    sum += a[i][j+b];
                }
                _mm_prefetch(&a[i+1][j], _MM_HINT_T2);
            }
        }
    }
    ftime(&end);
    delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
    printf ("Prefetching on:  N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
}

What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.

Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:

Prefetching off: N=4096, sum=3355443200, time=1839
Prefetching on:  N=4096, sum=3355443200, time=1502

Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.

Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try I'll be happy to know). In that case I'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.

Skylake results

Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:

Compile flags: -O3 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=896
Prefetching on:  N=4096, sum=28147495993344000, time=1222
Prefetching off: N=4096, sum=28147495993344000, time=886
Prefetching on:  N=4096, sum=28147495993344000, time=1291
Prefetching off: N=4096, sum=28147495993344000, time=890
Prefetching on:  N=4096, sum=28147495993344000, time=1234
Prefetching off: N=4096, sum=28147495993344000, time=848
Prefetching on:  N=4096, sum=28147495993344000, time=1220
Prefetching off: N=4096, sum=28147495993344000, time=852
Prefetching on:  N=4096, sum=28147495993344000, time=1253

Compile flags: -O2 -march=native

Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1813
Prefetching off: N=4096, sum=28147495993344000, time=1956
Prefetching on:  N=4096, sum=28147495993344000, time=1814
Prefetching off: N=4096, sum=28147495993344000, time=1955
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1961
Prefetching on:  N=4096, sum=28147495993344000, time=1811
Prefetching off: N=4096, sum=28147495993344000, time=1965
Prefetching on:  N=4096, sum=28147495993344000, time=1814

So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.

So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.

This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).


Acetyl answered 24/12, 2013 at 9:57 Comment(7)
Your use b while it is uninitialized in the line sum += a[i][j+b]; in the first loop (non-prefetch). Probably that line should just read sum += a[i][j]; since you aren't blocking stuff in that loop.Chamfer
I added some Skylake results.Chamfer
Thanks. you're right about the int overheads, the benefit can be stretched further if we look only at one element per cacheline (it's not a realistic scenario, but a good proxy for what vectorization might giveAcetyl
I tried it out (just touching each cache line) and found a weird interaction with powersaving that I put in an answer below.Chamfer
I don't really know C well, but your code gives me a segfault at line 15 when compiled with gcc 12.2.0Passacaglia
@jberryman, I'm not sure this alignment attribute is standard and I also saw it doesn't always work well (for example it's not guaranteed on mallocs). It's just meant as an example - there are other ways to ensure alignment when allocating an array - you can make it 4096 bytes bigger, and start it from ptr = ((ptr+4095) & ~0xfffULL)Acetyl
I am under the impression that very often HW prefetchers on their own (ie without hinting the HW to prefetch) do their job very poorly. Here is an example on arm cortex A53 I tested 5 minutes ago: pastebin.com/YXUs2raC . As you can see this is a purely linear traversal. The prefetcher should be able to predict all accesses, yet it fails consistently at (almost) every cache line end! Corresponding output with measurements: pastebin.com/3Rr5QJ4R compiled with -O0 so that the compiler does not interfere and we only rely on the HW. TLDR; add the hints or compile with O3.Caparison
C
10

On recent Intel chips one reason you apparently might want to use prefetching is to avoid CPU power-saving features artificially limiting your achieved memory bandwidth. In this scenario, simple prefetching can as much as double your performance versus the same code without prefetching, but it depends entirely on the selected power management plan.

I ran a simplified version (code here)of the test in Leeor's answer, which stresses the memory subsystem a bit more (since that's where prefetch will help, hurt or do nothing). The original test stressed the CPU in parallel with the memory subsystem since it added together every int on each cache line. Since typical memory read bandwidth is in the region of 15 GB/s, that's 3.75 billion integers per second, putting a pretty hard cap on the maximum speed (code that isn't vectorized will usually process 1 int or less per cycle, so a 3.75 GHz CPU will be about equally CPU and memory bount).

First, I got results that seemed to show prefetching kicking butt on my i7-6700HQ (Skylake):

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=221, MiB/s=11583
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=204, MiB/s=12549
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=200, MiB/s=12800
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=160, MiB/s=16000
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=201, MiB/s=12736
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=197, MiB/s=12994
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305

Eyeballing the numbers, prefetch achieves something a bit above 16 GiB/s and without only about 12.5, so prefetch is increasing speed by about 30%. Right?

Not so fast. Remembering that the powersaving mode has all sorts of wonderful interactions on modern chips, I changed my Linux CPU governor to performance from the default of powersave1. Now I get:

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=155, MiB/s=16516
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=157, MiB/s=16305
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=144, MiB/s=17777
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=152, MiB/s=16842
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=153, MiB/s=16732
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=159, MiB/s=16100
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=163, MiB/s=15705
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=161, MiB/s=15900

It's a total toss-up. Both with and without prefetching seem to perform identically. So either hardware prefetching is less aggressive in the high powersaving modes, or there is some other interaction with power saving that behaves differently with the explicit software prefetches.

Investigation

In fact, the difference between prefetching and not is even more extreme if you change the benchark. The existing benchmark alternates between runs with prefetching on and off, and it turns out that this helped the "off" variant because the speed increase which occurs in the "on" test partly carries over to the subsequent off test2. If you run only the "off" test you get results around 9 GiB/s:

Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=280, MiB/s=9142
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=277, MiB/s=9241
Prefetching off: SIZE=256 MiB, sum=1407374589952000, time=285, MiB/s=8982

... versus about 17 GiB/s for the prefetching version:

Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=149, MiB/s=17181
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297
Prefetching  on: SIZE=256 MiB, sum=1407374589952000, time=148, MiB/s=17297

So the prefetching version is almost twice as fast.

Let's take a look at what's going on with perf stat, for both the **off* version:

Performance counter stats for './prefetch-test off':

   2907.485684      task-clock (msec)         #    1.000 CPUs utilized                                          
 3,197,503,204      cycles                    #    1.100 GHz                    
 2,158,244,139      instructions              #    0.67  insns per cycle        
   429,993,704      branches                  #  147.892 M/sec                  
        10,956      branch-misses             #    0.00% of all branches     

... and the on version:

   1502.321989      task-clock (msec)         #    1.000 CPUs utilized                          
 3,896,143,464      cycles                    #    2.593 GHz                    
 2,576,880,294      instructions              #    0.66  insns per cycle        
   429,853,720      branches                  #  286.126 M/sec                  
        11,444      branch-misses             #    0.00% of all branches

The difference is that the version with prefetching on consistently runs at the max non-turbo frequency of ~2.6 GHz (I have disabled turbo via an MSR). The version without prefetching, however, has decided to run at a much lower speed of 1.1 GHz. Such large CPU differences often also reflect a large difference in uncore frequency, which can explain the worse bandwdith.

Now we've seen this before, and it is probably an outcome of the Energy Efficient Turbo feature on recent Intel chips, which try to ramp down the CPU frequency when they determine a process is mostly memory bound, presumably since increased CPU core speed doesn't provide much benefit in those cases. As we can see here, this assumption isn't always true, but it isn't clear to me if the tradeoff is a bad one in general, or perhaps the heuristic only occasionally gets it wrong.


1 I'm running the intel_pstate driver, which is the default for Intel chips on recent kernels which implements "hardware p-states", also known as "HWP". Command used: sudo cpupower -c 0,1,2,3 frequency-set -g performance.

2 Conversely, the slowdown from the "off" test partly carries over into the "on" test, although the effect is less extreme, possibly because the powersaving "ramp up" behavior is faster than "ramp down".

Chamfer answered 19/7, 2017 at 21:52 Comment(10)
I've noticed similar but much less pronounced effects in microbenchmarks of real code on SKL with the Linux default balance_power EPP setting. It seems to ramp down the core clock speed on very "memory bound" code, but that can hurt bandwidth even more. Running a pause loop on another core is another way to keep the clocks up.Venavenable
Also, you can use sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done' to tweak the EPP setting directly (for all CPUs). (see patchwork.kernel.org/patch/9723429 for the details on Skylake's Energy-Performance Preference). I assume cpupower is probably fine, but I came across the sysfs stuff first while trying to figure out why my 4.0GHz SKL was only turboing to 3.9GHz.Venavenable
With HWP, I think EPP is all there is. Or does the performance governor disable HWP?Venavenable
I'm pretty sure both use HWP but my understanding was that they are quite different beyond EPB and EPP. For example, the governor can set the min freq even for HWP and also adjust the C-state behavior. My current kernel only knows about EBP IIRC but maybe I can compile the new x86_perf_polocy tool and see. Mobile now but will try tommorow.Chamfer
What did you find out about 3.9 vs 4.0? I noticed also my SKL only turbo'd to 3.4 not 3.5 when running with acpi_cpufreq instead of intel_pstate. Like only pstate knows the trick to enable the highest p-state or perhaps the highest p-state simply isn't available without HWP.Chamfer
I found out that it turboed properly (to the bios-configured 4.4GHz) right after bootup, but after idling for a couple minutes it capped at 3.9. Changing the EPP to balance_performance fixed it: 4.4GHz. Going back to balance_power still had it working, but after a couple minutes idle in balance_power mode, it would be capped at 3.9 again. I assume that's a Linux bug. This was with Linux 4.10 and 4.11 (Arch Linux)Venavenable
@PeterCordes - I don't have /sys/.../cpufreq/policy*/energy_performa‌​nce_preference. I'm on 4.4.0, so perhaps it hasn't been introduced yet. On my kernel, I think the way to set the perf preference is cpu-power --perf-bias VALUE, but as far as I know this only sets the EBP not the EPP. Usually Intel makes stuff backwards compatible, but the wording in this case makes it seem otherwise. Changing the governor between performance and powersave didn't change the reported "perf-bias" value (it seems to be at 15 at boot), and manually changing the value had no effect on performance.Chamfer
Another interesting fact is that when I do a run of 10 iterations of the test loop from idle on powersave, the performance generally decreases in a typical pattern starting from the first iteration, e.g., (in MiB/s): 11689, 10240, 10322, 10406, 10199, 10158, 9770, 9552, 9481, 9208. I interpret it has powersave mode initially ramping up to a higher CPU value, assuming that the burst of activity is "interactive" and you get a lot of value from finishing the job quickly in terms of UI latency, etc, but once it runs for a while it decides that it's batch and ramps down.Chamfer
... or that could be a feature outside of intel_pstate in the generic kernel itself, but which communicates the expectations to the pstate driverChamfer
@PeterCordes The p-state power stuff mentioned here is exactly the problem I ran into on Skylake X with Ubuntu 17.04. For some of my work-loads the cores would stay at 700 MHz - 1.2 GHz and refuse to clock up to the full 4.5. Initially I thought this was related to the phantom throttling BS since that was under investigation at the same time. Disabling the p-states solved everything. Ubuntu 17.10 seems to work fine out of the box.Stratigraphy
C
6

Here's a brief summary of cases that I'm aware of in which software prefetching may prove especially useful. Some may not apply to all hardware.

This list should be read from the point of view that the most obvious place software prefetches could be used is where the stream of accesses can be predicted in software, and yet this case isn't necessarily such an obvious win for SW prefetch because out-of-order processing often ends up having a similar effect since it can execute behind existing misses in order to get more misses in flight.

So this list is more a "in light of the fact that SW prefetch isn't as obviously useful as it might first seem, here are some places it might still be useful anyways", often compared to the alternative of either just letting out-of-order processing do its thing or just using "plain loads" to load some values before they are needed.

Fitting more loads in the out-of-order window

Although out-of-order processing can potentially expose the same type of MLP (Memory-Level Parallelism) as software prefetches, there are limits inherent to the total possible lookahead distance after a cache miss. These include reorder-buffer capacity, load buffer capacity, scheduler capacity and so on. See this blog post for an example of where extra work seriously hinders MLP because the CPU can't run ahead far enough to get enough loads executing at once.

In this case, software prefetch allows you to effectively stuff more loads earlier in the instruction stream. As an example, imagine you have a loop which performs one load and then 20 instructions worth of work on the loaded data, and your CPU has an out-of-order buffer of 100 instructions and that loads are independent from each other (e.g,. accessing an array with a known stride).

After the first miss, you can run ahead 99 more instructions which will be composed of 95 non-load and 5 load instructions (including the first load). So your MLP is inherently limited to 5 by the size of the out-of-order buffer. If instead you paired every load with two software prefetches to a location say 6 or more iterations ahead, you'd end up instead with 90 non-load instructions, 5 loads and 5 software prefetches and since all those loads are you just doubled your MLP to 102.

There is of course no limit of one additional prefetch per load: you could add more to hit higher numbers, but there is a point of diminishing and then negative returns as you hit the MLP limits of your machine and the prefetches take up resources you'd rather spend on other things.

This is similar to software pipelining, where you load data for a future iteration, and then don't touch that register until after a significant amount of other work. This was mostly used on in-order machines to hide latency of computation as well as memory. Even on a RISC with 32 architectural registers, software-pipelining typically can't place the loads as far ahead of use as an optimal prefetch-distance on a modern machine; the amount of work a CPU can do during one memory latency has grown a lot since the early days of in-order RISCs.

In-order machines

Not all machines are bit out-of-order cores: in-order CPUs are still common in some places (especially outside x86), and you'll also find "weak" out of order cores that don't have the capability to run ahead very far and so partly act like in-order machines.

On these machines software prefetches may help gain MLP that you wouldn't otherwise be able access (of course, an in-order machine probably doesn't support a lot of inherent MLP otherwise).

Working around hardware prefetch restrictions

Hardware prefetch may have restrictions which you could work around using software prefetch.

For example, Leeor's answer has an example of hardware prefetch stopping at page boundaries, while software prefetch doesn't have any such restriction.

Another example might be any time that hardware prefetch is too aggressive or too conservative (after all it has to guess at your intentions): you might use software prefetch instead since you know exactly how your application will behave.

Examples of the latter include prefetching discontiguous areas: such as rows in a sub-matrix of a larger matrix: hardware prefetch won't understand the boundaries of the "rectangular" region and will constantly prefetch beyond the end of each row, and then take a bit of time to pick up the new row pattern. Software prefetching can get this exactly right: never issuing any useless prefetches at all (but it often requires ugly splitting of loops).

If you do enough software prefetches, the hardware prefeteches should in theory mostly shut down, because the activity of the memory subsystem is one heuristic they use to decide whether to activate.

Counterpoint

I should note here that software prefetching is not equivalent to hardware prefetching when it comes to possible speedups for cases the hardware prefetching can pick up: hardware prefetching can be considerably faster. That is because hardware prefetching can start working closer to memory (e.g., from the L2) where it has a lower latency to memory and also access to more buffers (in the so-called "superqueue" on Intel chips) and so more concurrency. So if you turn off hardware prefetching and try to implement a memcpy or some other streaming load with pure software prefetching, you'll find that it is likely slower.

Special load hints

Prefetching may give you access to special hints that you can't achieve with regular loads. For example x86 has the prefetchnta, prefetcht0, prefetcht1, and prefetchw instructions which hint to the processor how to treat the loaded data in the caching subsystem. You can't achieve the same effect with plain loads (at least on x86).


2 It's not actually as simple as just adding a single prefetch to the loop, since after the first five iterations, the loads will start hitting already prefetched values, reducing your MLP back to 5 - but the idea still holds. A real implementation would also involve reorganizing the loop so that the MLP can be sustained (e.g., "jamming" the loads and prefetches together every few iterations).

Chamfer answered 19/8, 2018 at 22:13 Comment(0)
L
2

There are definitely situations where software prefetch provides significant performance improvements.

For example, if you are accessing a relatively slow memory device such as Optane DC Persistent Memory, which has access times of several hundred nanoseconds, prefetching can reduce effective latency by 50 percent or more if you can do it far enough in advance of the read or write.

This isn't a very common case at present but it will become a lot more common if and when such storage devices become mainstream.

Lactiferous answered 21/8, 2020 at 15:19 Comment(0)
T
1

The article 'What Every Programmer Should Know About Memory Ulrich Drepper' discusses situations where pre-fetching is advantageous; http://www.akkadia.org/drepper/cpumemory.pdf , warning: this is quite a long article that discusses things like memory architecture / how the cpu works, etc.

prefetching gives something if the data is aligned to cache lines; and if you are loading data that is about to be accessed by the algorithm;

In any event one should do this when trying to optimize highly used code; benchmarking is a must and things usually work out differently than one might use to think.

Thiourea answered 25/12, 2013 at 11:27 Comment(4)
That paper was written when Pentium 4 was current. Hardware prefetch in more recent CPUs is much better, so SW prefetch is not usually a good idea anymore. That paper is still excellent, but just keep in mind that the SW prefetching advice is for P4 and mostly doesn't apply anymore.Venavenable
@PeterCordes i wouldn't be so sure about that.- for example the linux kernel still has a lot of prefetch calls , elixir.free-electrons.com/linux/latest/ident/prefetchThiourea
Right, prefetch can still help for non-sequential access. Where Drepper's article is really out of date for modern CPUs is the prefetch-thread suggestion. P4 didn't have big enough caches (especially trace cache) or other features to really benefit from hyperthreading with two different threads, but apparently a prefetch thread was worth it for looping over an array. With smarter HW prefetch, you generally shouldn't SW prefetch for sequential access (except maybe pfNTA). I hadn't realized Linux used so much prefetch; I see some in a btree function; trees are still a good use-case.Venavenable
It wasn't my downvote; I gave you an upvote to bring this answer back to 0. But your point about aligned data doesn't really make sense. Prefetching the cache line that contains your data is good whether or not it's aligned. If you meant that it's potentially less useful when a struct crosses a cache-line boundary or something, you should say that. Natural alignment (e.g. 16B for a 16B struct) is sufficient to guarantee that won't happen; you don't need your structs to be 64B aligned. (And even so, prefetching one line helps. The adjacent-line HW prefetcher may even get the other for you)Venavenable
C
-1

It seems, that the best policy to follow is to never use __builtin_prefetch (and its friend, __builtin_expect) at all. On some platforms those may help (and even help a lot) - however, one must always do some benchmarking to confirm this. The real question, whether the short term performance gains will worth the trouble in the longer run.

First, one may ask the following question: what these statements actually do when fed to a higher end modern CPU? The answer is: nobody really knows (except, may be, few guys on the CPU's core architecture team, but they are not going to tell anybody). Modern CPUs are very complex machines, capable of instruction reordering, speculative execution of instructions across possibly not taken branches, etc., etc. Moreover, the details of this complex behavior may (and will) differ considerably between CPU generations and vendors (Intel Core vs Intel I* vs AMD Opteron; with more fragmented platforms like ARM the situation is even worse).

One neat example (not prefetch related, but still) of CPU functionality which used to speed things up on older Intel CPUs, but sucks badly on the more modern one is outlined here: http://lists-archives.com/git/744742-git-gc-speed-it-up-by-18-via-faster-hash-comparisons.html. In that particular case, it was possible to achieve 18% performance increase by replacing the optimized version of gcc supplied memcmp with an explicit ("naive" so to say) loop.

Consist answered 20/12, 2013 at 6:33 Comment(7)
Completely wrong position about builtin-expect. Marking as expected those branches in your code that are REALLY expected, helps compiler a much in high-level code optimizations and not CPU-dependent at all. Agree on builtin-prefetch. One must not use it unless knows what he doing.Fendley
Which compiler? Some compilers have __builtin_expect coded as nop. :)Consist
@Consist Nice post, though it's not about prefetch. I have read it.Rogelioroger
@KonstantinVladimirov Yeah, it seem likely and unlikly are useful, they are in linux kernel. Maybe we can get some benefits in some platforms.Rogelioroger
Critical points to consider: 1. optimizer will still apply static branch prediction heuristics whether hint is present or not. 2. CPU branch prediction unit will still be working, whether hint is supplied or not. Thus, on decent CPU with decent optimizing compiler not much can be expected from hints.Consist
@oakad, builtin_expect do not correspond to any code. It is only hint for compiler that some branch is expected to execute here in most cases (say regular and error-handling branches of if clause). It is not ever nop, of course. And I am saying about GCC. Part of GCC, that handles builtin_expect is architecture-independent and known as basic block reordering pass (and some other passes in middle-end).Fendley
This answer is basically wrong; some details are published (e.g. in Intel's optimization manual). But the basic message is correct: software prefetching has to be tuned for the uarch, and the same code might run worse than with no prefetching on some future uarch. Part of this is that it's not just whether to use them or not, it's that the optimal prefetch distance is a magic number (in bytes, between where you load and where you prefetch). See Linus Torvald's comments on it.Venavenable

© 2022 - 2024 — McMap. All rights reserved.