I decide I want to benchmark a particular function, so I naïvely write code like this:
#include <ctime>
#include <iostream>
int SlowCalculation(int input) { ... }
int main() {
std::cout << "Benchmark running..." << std::endl;
std::clock_t start = std::clock();
int answer = SlowCalculation(42);
std::clock_t stop = std::clock();
double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
std::cout << "Benchmark took " << delta << " seconds, and the answer was "
<< answer << '.' << std::endl;
return 0;
}
A colleague pointed out that I should declare the start
and stop
variables as volatile
to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:
std::clock_t start = std::clock();
std::clock_t stop = std::clock();
int answer = SlowCalculation(42);
At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.
But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?
Nevertheless, I added volatile
and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.
My question is:
What is the best way to prevent code reordering in benchmarking code like this?
My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.
Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.
I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.
For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.
SlowCalculation
has no side effects? – Viburnumvolatile
merely means that the memory access may not be optimized away, and it may not be reordered with respect to other observable side-effects of your code (including other volatile accesses). IfSlowCalculation
has no side effects, then I'm not surevolatile
makes this any "safer". – Viburnumvolatile
makes a difference. – Cantilenavolatile
are treated as CPU I/O operations and are never elided, reordered or speculated. – Gloriasm volatile ("":::"memory");
here? – Errolerrollvolatile
is because the CPU is forced to actually read memory when something is tagged asvolatile
and wha you're observing has nothing to do with reordering. – Fimbria