I learned a lot from the excellent answers provided by @JamesScriven and @Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).
There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.
In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.
For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.
That is the reason, why there is such a modest improvement for @Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.
The crucial insight from the @JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.
So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):
//making random accesses to memory:
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
//the actual work is happening here
void operator()(){
//set up the oracle - let see it in the future oracle_offset steps
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//use oracle and prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work, the less the better
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index); #update oracle
index=next(mem[index]); #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
}
}
Some remarks:
- data is prepared in such a way, that the oracle is alway right.
- maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is
CPU-time+original-latency-time/CPU-time
.
Compiling and executing leads:
>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops time no prefetch time prefetch factor
...
7 1.0711102260000001 0.230566831 4.6455521002498408
8 1.0511602149999999 0.22651144600000001 4.6406494398521474
9 1.049024333 0.22841439299999999 4.5926367389641687
....
to a speed-up between 4 and 5.
Listing of prefetch_demp.cpp
:
//prefetch_demo.cpp
#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>
const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;
unsigned int next(unsigned int current){
return (current*10001+328)%SIZE;
}
template<bool prefetch>
struct Worker{
std::vector<int> mem;
double result;
int oracle_offset;
void operator()(){
unsigned int prefetch_index=0;
for(int i=0;i<oracle_offset;i++)
prefetch_index=next(prefetch_index);
unsigned int index=0;
for(int i=0;i<STEP_CNT;i++){
//prefetch memory block used in a future iteration
if(prefetch){
__builtin_prefetch(mem.data()+prefetch_index,0,1);
}
//actual work:
result+=mem[index];
//prepare next iteration
prefetch_index=next(prefetch_index);
index=next(mem[index]);
}
}
Worker(std::vector<int> &mem_):
mem(mem_), result(0.0), oracle_offset(0)
{}
};
template <typename Worker>
double timeit(Worker &worker){
auto begin = std::chrono::high_resolution_clock::now();
worker();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}
int main() {
//set up the data in special way!
std::vector<int> keys(SIZE);
for (int i=0;i<SIZE;i++){
keys[i] = i;
}
Worker<false> without_prefetch(keys);
Worker<true> with_prefetch(keys);
std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
std::cout<<std::setprecision(17);
for(int i=0;i<20;i++){
//let oracle see i steps in the future:
without_prefetch.oracle_offset=i;
with_prefetch.oracle_offset=i;
//calculate:
double time_with_prefetch=timeit(with_prefetch);
double time_no_prefetch=timeit(without_prefetch);
std::cout<<i<<"\t"
<<time_no_prefetch<<"\t"
<<time_with_prefetch<<"\t"
<<(time_no_prefetch/time_with_prefetch)<<"\n";
}
}