I am working on a project where we have to implement an algorithm that is proven in theory to be cache friendly. In simple terms, if N
is the input and B
is the number of elements that get transferred between the cache and the RAM every time we have a cache miss, the algorithm will require O(N/B)
accesses to the RAM.
I would like to show that this is indeed the behavior in practice. To better understand how one can measure various cache related hardware counters, I decided to use different tools. One is Perf and the other is the PAPI library. Unfortunately, the more I work with these tools, the less I understand what they do exactly.
I am using an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB. The cache line size is 64 bytes. I guess that must be the size of the block B
.
Let's look at the following example:
#include <iostream>
using namespace std;
struct node{
int l, r;
};
int main(int argc, char* argv[]){
int n = 1000000;
node* A = new node[n];
int i;
for(i=0;i<n;i++){
A[i].l = 1;
A[i].r = 4;
}
return 0;
}
Each node requires 8 bytes, which means that a cache line can fit 8 nodes, so I should be expecting approximately 1000000/8 = 125000
L3 cache misses.
Without optimization (no -O3
), this is the output from perf:
perf stat -B -e cache-references,cache-misses ./cachetests
Performance counter stats for './cachetests':
162,813 cache-references
142,247 cache-misses # 87.368 % of all cache refs
0.007163021 seconds time elapsed
It is pretty close to what we are expecting. Now suppose that we use the PAPI library.
#include <iostream>
#include <papi.h>
using namespace std;
struct node{
int l, r;
};
void handle_error(int err){
std::cerr << "PAPI error: " << err << std::endl;
}
int main(int argc, char* argv[]){
int numEvents = 2;
long long values[2];
int events[2] = {PAPI_L3_TCA,PAPI_L3_TCM};
if (PAPI_start_counters(events, numEvents) != PAPI_OK)
handle_error(1);
int n = 1000000;
node* A = new node[n];
int i;
for(i=0;i<n;i++){
A[i].l = 1;
A[i].r = 4;
}
if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
handle_error(1);
cout<<"L3 accesses: "<<values[0]<<endl;
cout<<"L3 misses: "<<values[1]<<endl;
cout<<"L3 miss/access ratio: "<<(double)values[1]/values[0]<<endl;
return 0;
}
This is the output that I get:
L3 accesses: 3335
L3 misses: 848
L3 miss/access ratio: 0.254273
Why such a big difference between the two tools?