Inspired by SQLite, I'm looking at using valgrind's "cachegrind" tool to do reproducible performance benchmarking. The numbers it outputs are much more stable than any other method of timing I've found, but they're still not deterministic. As an example, here's a simple C program:
int main() {
volatile int x;
while (x < 1000000) {
x++;
}
}
If I compile it and run it under cachegrind, I get the following results:
$ gcc -O2 x.c -o x
$ valgrind --tool=cachegrind ./x
==11949== Cachegrind, a cache and branch-prediction profiler
==11949== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==11949== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==11949== Command: ./x
==11949==
--11949-- warning: L3 cache found, using its data for the LL simulation.
==11949==
==11949== I refs: 11,158,333
==11949== I1 misses: 3,565
==11949== LLi misses: 2,611
==11949== I1 miss rate: 0.03%
==11949== LLi miss rate: 0.02%
==11949==
==11949== D refs: 4,116,700 (3,552,970 rd + 563,730 wr)
==11949== D1 misses: 21,119 ( 19,041 rd + 2,078 wr)
==11949== LLd misses: 7,487 ( 6,148 rd + 1,339 wr)
==11949== D1 miss rate: 0.5% ( 0.5% + 0.4% )
==11949== LLd miss rate: 0.2% ( 0.2% + 0.2% )
==11949==
==11949== LL refs: 24,684 ( 22,606 rd + 2,078 wr)
==11949== LL misses: 10,098 ( 8,759 rd + 1,339 wr)
==11949== LL miss rate: 0.1% ( 0.1% + 0.2% )
$ valgrind --tool=cachegrind ./x
==11982== Cachegrind, a cache and branch-prediction profiler
==11982== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==11982== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==11982== Command: ./x
==11982==
--11982-- warning: L3 cache found, using its data for the LL simulation.
==11982==
==11982== I refs: 11,159,225
==11982== I1 misses: 3,611
==11982== LLi misses: 2,611
==11982== I1 miss rate: 0.03%
==11982== LLi miss rate: 0.02%
==11982==
==11982== D refs: 4,117,029 (3,553,176 rd + 563,853 wr)
==11982== D1 misses: 21,174 ( 19,090 rd + 2,084 wr)
==11982== LLd misses: 7,496 ( 6,154 rd + 1,342 wr)
==11982== D1 miss rate: 0.5% ( 0.5% + 0.4% )
==11982== LLd miss rate: 0.2% ( 0.2% + 0.2% )
==11982==
==11982== LL refs: 24,785 ( 22,701 rd + 2,084 wr)
==11982== LL misses: 10,107 ( 8,765 rd + 1,342 wr)
==11982== LL miss rate: 0.1% ( 0.1% + 0.2% )
$
In this case, "I refs" differs by only 0.008% between the two runs but I still wonder why these are different. In more complex programs (tens of milliseconds) they can vary by more. Is there any way to make the runs completely reproducible?