benchmarking, code reordering, volatile

Asked 23/2, 2013 at 14:21 Answered 17/1, 2019 at 22:14

Solved c++benchmarking compiler-optimization volatile

I decide I want to benchmark a particular function, so I naïvely write code like this:

#include <ctime>
#include <iostream>

int SlowCalculation(int input) { ... }

int main() {
    std::cout << "Benchmark running..." << std::endl;
    std::clock_t start = std::clock();
    int answer = SlowCalculation(42);
    std::clock_t stop = std::clock();
    double delta = (stop - start) * 1.0 / CLOCKS_PER_SEC;
    std::cout << "Benchmark took " << delta << " seconds, and the answer was "
              << answer << '.' << std::endl;
    return 0;
}

A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering. He suggested that the optimizer could, for example, effectively reorder the code like this:

    std::clock_t start = std::clock();
    std::clock_t stop = std::clock();
    int answer = SlowCalculation(42);

At first I was skeptical that such extreme reordering was allowed, but after some research and experimentation, I learned that it was.

But volatile didn't feel like the right solution; isn't volatile really just for memory mapped I/O?

Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms. The disassembly listings for the two versions showed virtually no difference other than a different selection of registers. This makes me wonder if there is another way to avoid the code reordering that doesn't have such overwhelming side effects.

My question is:

What is the best way to prevent code reordering in benchmarking code like this?

My question is similar to this one (which was about using volatile to avoid elision rather than reordering), this one (which didn't answer how to prevent reordering), and this one (which debated whether the issue was code reordering or dead code elimination). While all three are on this exact topic, none actually answer my question.

Update: The answer appears to be that my colleague was mistaken and that reordering like this isn't consistent with the standard. I've upvoted everyone who said so and am awarding the bounty to the Maxim.

I've seen one case (based on the code in this question) where Visual Studio 2010 reordered the clock calls as I illustrated (only in 64-bit builds). I'm trying to make a minimal case to illustrate that so that I can file a bug on Microsoft Connect.

For those who said that volatile should be much slower because it forces reads and writes to memory, this isn't quite consistent with the code being emitted. In my answer on this question, I show the disassembly for the the code with and without volatile. Inside the loop, everything is kept in registers. The only significant differences appear to be register selection. I do not understand x86 assembly well enough to know why the performance of the non-volatile version is consistently fast while the volatile version is inconsistently (and sometimes dramatically) slower.

Electrician answered 23/2, 2013 at 14:21 Comment(17)

@juanchopanza: What if the compiler knows that SlowCalculation has no side effects? – Viburnum 23/2, 2013 at 14:26

@OliCharlesworth good point. I have to do some thinking now. – Equi 23/2, 2013 at 14:27

volatile merely means that the memory access may not be optimized away, and it may not be reordered with respect to other observable side-effects of your code (including other volatile accesses). If SlowCalculation has no side effects, then I'm not sure volatile makes this any "safer". – Viburnum 23/2, 2013 at 14:31

You should be able to look at the assembly and see if the volatile makes a difference. – Cantilena 23/2, 2013 at 15:11

volatile has nothing to do with reordering in ISO C++, but it is different in MSVC. In MSVC without special flag it does prevent reordering. – Declivous 23/2, 2013 at 15:11

Memory operations with volatile are treated as CPU I/O operations and are never elided, reordered or speculated. – Glori 25/2, 2013 at 16:40

Um, use a real profiler if possible? :) – Wundt 25/2, 2013 at 16:50

Is there some reason not to just use the usual asm volatile ("":::"memory"); here? – Errolerroll 25/2, 2013 at 16:50

Most likely the slow down with volatile is because the CPU is forced to actually read memory when something is tagged as volatile and wha you're observing has nothing to do with reordering. – Fimbria 25/2, 2013 at 17:24

I'm with @MichaelDorgan - why not use a real profiler? – Ulbricht 25/2, 2013 at 18:0

a compiler that reorders such code is broken – Acapulco 25/2, 2013 at 18:59

@Kerrick SB: As I stated in the question, I did compare the disassembly with and without volatile. Since then, I've also tried 64-bit build, and, in with 64-bit, the compiler does in fact reorder the second clock call before the slow calculation. Several people have suggested that's a compiler bug. – Electrician 26/2, 2013 at 0:21

@JackAidley: I would expect it to be slower with volatile, but I wouldn't expect it to be that much slower (8x), nor would I expect it to be as wildly inconsistent as it is. – Electrician 26/2, 2013 at 0:25

@AdrianMcCarthy: I'm not remotely surprised by slowdowns of that magnitude. You're forcing it to effectively have cache misses each time so the processor becomes strictly bound to memory speed. – Fimbria 26/2, 2013 at 11:14

answer should be also volatile. – Lcm 4/3, 2013 at 10:1

Related: Avoid optimizing away variable with inline asm for use inside tight loops. – Landholder 19/1, 2019 at 2:34

@JackAidley "You're forcing it to effectively have cache misses" How do you do that? – Pedersen 29/1, 2019 at 7:2

A colleague pointed out that I should declare the start and stop variables as volatile to avoid code reordering.

Sorry, but your colleague is wrong.

The compiler does not reorder calls to functions whose definitions are not available at compile time. Simply imagine the hilarity that would ensue if compiler reordered such calls as fork and exec or moved code around these.

In other words, any function with no definition is a compile time memory barrier, that is, the compiler does not move subsequent statements before the call or prior statements after the call.

In your code calls to std::clock end up calling a function whose definition is not available.

I can not recommend enough watching atomic Weapons: The C++ Memory Model and Modern Hardware because it discusses misconceptions about (compile time) memory barriers and volatile among many other useful things.

Nevertheless, I added volatile and found that not only did the benchmark take significantly longer, it also was wildly inconsistent from run to run. Without volatile (and getting lucky to ensure the code wasn't reordered), the benchmark consistently took 600-700 ms. With volatile, it often took 1200 ms and sometimes more than 5000 ms

Not sure if volatile is to blame here.

The reported run-time depends on how the benchmark is run. Make sure you disable CPU frequency scaling so that it does not turn on turbo mode or switches frequency in the middle of the run. Also, micro-benchmarks should be run as real-time priority processes to avoid scheduling noise. It could be that during another run some background file indexer starts competing with your benchmark for the CPU time. See this for more details.

A good practice is to measure times it takes to execute the function a number of times and report min/avg/median/max/stdev/total time numbers. High standard deviation may indicate that the above preparations are not performed. The first run often is the longest because the CPU cache may be cold and it may take many cache misses and page faults and also resolve dynamic symbols from shared libraries on the first call (lazy symbol resolution is the default run-time linking mode on Linux, for example), while subsequent calls are going to execute with much less overhead.

Glori answered 25/2, 2013 at 16:35 Comment(15)

If you are correct, then my compiler (MSVC++ 2010 in 64-bit mode) is broken because I found a case where it reordered the clock calls exactly as I showed. I guess I'll file a bug. As for the inconsistent run times with volatile, I'm aware of the external factors, and I have minimized them. The odd thing is that the times are very consistently inconsistent with volatile, and consistently consistent without volatile, so I don't think it's anything as random as a file indexer kicking in. Thanks for the video link, it was already on my "to watch" list. – Electrician 26/2, 2013 at 0:17

You may like to run your code on Linux under Valgrind to see line-by-line execution time and cache effects. They must have something similar for Windows though. Still, I would like to see the code where it re-orders the code the way you describe. – Glori 26/2, 2013 at 9:20

It does not reorder calls to std::clock() but it may inline and move call to SlowCalculation() wherever it pleases (and often does). Why else people use barriers? – Lcm 4/3, 2013 at 8:9

I did read it. What was there to read? When you have 3 writes to volatile variables in a row then compiler can not reorder those. Even if all 3 calculations can be inlined. – Lcm 4/3, 2013 at 10:4

It is dangerous to assume the compiler does not know something that it actually can know. For instance, std::clock is a function defined in the standard library, which the compiler is providing. It is not legal for the user to define anything in namespace std, so the compiler knows you are calling its version of std::clock, so this is not the reason why this isn't allowed. Even if SlowCalculation is defined in some other translation unit, that also does not turn off optimization, because Visual Studio, clang, and gcc all support link-time optimization. – Jabalpur 8/4, 2016 at 4:5

@DavidStone I do not think the compiler "knows" anything about functions, unless it is an intrinsic function. – Glori 8/4, 2016 at 6:59

@MaximEgorushkin: Nothing stops a vendor from giving their compiler special knowledge about any function in namespace std if the user cannot specialize or overload it. – Jabalpur 9/4, 2016 at 17:9

The reason is problematic. std::clock is not required to be an library I/O function so itself is not free of reordering. Even it is true that ISO C++ mandates std::clock implemented as a function (rather than a macro) and no sane implementation should reorder calls of it, it is not guaranteed to behave as you said with any conforming implementations. And treating inlining as compile-time barrier is plain wrong in general. Implementations can eliminate subsequent calls totally when it can prove the function has no side effects, e.g. when declared with __attribute__((__const__)) in G++. – Hinds 16/6, 2016 at 1:13

@Hinds Right, from standard's point of view any function not marked __attribute__((__const__)) is an I/O function. – Glori 16/6, 2016 at 11:35

This is the view of the implementation, which is a conservative but practical strategy. However I don't find the standard requires it to be one of I/O functions whose call will make side effects "which are changes in the state of the execution environment". And at least currently, the standard has nothing equivalent to __attribute__((__const__)), etc. – Hinds 17/6, 2016 at 2:7

@Hinds One could conceive another implementation of the entire standard library or std::clock alone where std::clock does I/O and that would still be standard-compliant. – Glori 17/6, 2016 at 9:19

This is the status quo. However, if I want my code itself to be conforming to the standard (that is, avoiding relying on any concrete implementation details), I have to make effort to guarantee by myself that the compiler will never reorder the code. Meanwhile (perhaps) the only choice is to code very carefully: to add volatile on both the counter variable and the intermediate result to store the code to be measured. That might be what the colleague of op thought, and unfortunately, it was somehow correct. – Hinds 18/6, 2016 at 9:38

gcc and others can/will reorder calls to clock functions (not relative to each other but relative to the code under test) like this and make the benchmark results invalid. volatile so far is the only way to prevent that. – Tophus 19/1, 2019 at 13:33

@Tophus Show an example. – Glori 19/1, 2019 at 16:59

@MaximEgorushkin A compiler can know something about a user written function when the programmer uses (compiler specific) attributes: "For example, you can use attributes to specify that a function never returns (noreturn), returns a value depending only on the values of its arguments (const), or has printf-style arguments (format)."6.32 Declaring Attributes of Functions – Pedersen 29/1, 2019 at 7:14

The usual way to prevent reordering is a compile barrier i.e asm volatile ("":::"memory"); (with gcc). This is an asm instruction which does nothing, but we tell the compiler it will clobber memory, so it is not permitted to reorder code across it. The cost of this is only the actual cost of removing the reorder, which obviously is not the case for changing the optimisation level etc as suggested elsewhere.

I believe _ReadWriteBarrier is equivalent for Microsoft stuff.

Per Maxim Yegorushkin's answer, reordering is unlikely to be the cause of your issues though.

Errolerroll answered 25/2, 2013 at 17:10 Comment(1)

"it will clobber memory" What memory exactly? Do you mean externally accessible objects? – Pedersen 29/1, 2019 at 7:42

Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".

In order to prevent compiler reordering you may use so called compiler fences. MSVC includes 3 compiler fences:

_ReadWriteBarrier() - full fence

_ReadBarrier() - two-sided fence for loads

_WriteBarrier() - two-sided fence for stores

ICC includes __memory_barrier() full fence.

Full fences are usually the best choice because there is no need in finer-granularity on this level (compiler fences are basically costless in run-time).

Statment reordering (which most compiler do when optimization is enabled), thats the also main reason why certain program fail to operation operation when compiled with compiler optimzation.

Will suggest to read http://preshing.com/20120625/memory-ordering-at-compile-time to see potential issues we can face with compiler reoredering etc.

Alterant answered 26/2, 2013 at 11:6 Comment(1)

volatile also guarantees that the value is written the way ABI defines the value representation of that object; and that any valid ABI value representation can be read back, and that the compiler doesn't assume anything regarding the value obtained from such read, even if there was a read of a write immediately before – Pedersen 29/1, 2019 at 7:38

Related problem: how to stop the compiler from hoisting a tiny repeated calculation out of a loop

I couldn't find this anywhere - so adding my own answer 11 years after the question was asked ;).

Using volatile on variables is not what you want for that. That will cause the compiler to load and store those variable from and to RAM every single time (assuming there is a side effect of that that must be preserved: aka - good for I/O registers). When you are bench marking you are not interested in measuring how long it takes to get something from memory, or write it there. Often you just want your variable to be in CPU registers.

volatile is usable if you assign to it once outside a loop that doesn't get optimized away (like summing an array), as an alternative to printing the result. (Like the long-running function in the question). But not inside a tiny loop; that will introduce store/reload instructions and store-forwarding latency.

I think that the ONLY way to submit your compiler into not optimizing your benchmark code to hell is by using asm. This allows you to fool the compiler into thinking it doesn't know anything about your variables content, or usage, so it has to do everything every single time, as often as your loop asks it to.

For example, if I wanted to benchmark m & -m where m is some uint64_t, I could try:

uint64_t const m = 0x0000080e70100000UL;
for (int i = 0; i < loopsize; ++i)
{
  uint64_t result = m & -m;
}

The compiler would obviously say: I'm not even going to calculate that; since you're not using the result. Aka, it would actually do:

for (int i = 0; i < loopsize; ++i)
{
}

Then you can try:

uint64_t const m = 0x0000080e70100000UL;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
  result = m & -m;
}

and the compiler says, ok - so you want me to write to result every time and do

uint64_t const m = 0x0000080e70100000UL;
uint64_t tmp = m & -m;
static uint64_t volatile result;
for (int i = 0; i < loopsize; ++i)
{
  result = tmp;
}

Spending a lot of time writing to the memory address of result loopsize times, just as you asked.

Finally you could also make m volatile, but the result would look like this in assembly:

507b:   ba e8 03 00 00          mov    $0x3e8,%edx
  # top of loop
5080:   48 8b 05 89 ef 20 00    mov    0x20ef89(%rip),%rax        # 214010 <m_test>
5087:   48 8b 0d 82 ef 20 00    mov    0x20ef82(%rip),%rcx        # 214010 <m_test>
508e:   48 f7 d8                neg    %rax
5091:   48 21 c8                and    %rcx,%rax
5094:   48 89 44 24 28          mov    %rax,0x28(%rsp)
5099:   83 ea 01                sub    $0x1,%edx
509c:   75 e2                   jne    5080 <main+0x120>

Reading from memory twice and writing to it once, besides the requested calculation with registers.

The correct way to do this is therefore:

for (int i = 0; i < loopsize; ++i)
{
  uint64_t result = m & -m;
  asm volatile ("" : "+r" (m) : "r" (result));
}

which results in the assembly code (from gcc8.2 on the Godbolt compiler explorer):

 # gcc8.2 -O3 -fverbose-asm
    movabsq $8858102661120, %rax      #, m
    movl    $1000, %ecx     #, ivtmp_9     # induction variable tmp_9
.L2:
    mov     %rax, %rdx      # m, tmp91
    neg     %rdx            # tmp91
    and     %rax, %rdx      # m, result
       # asm statement here,  m=%rax   result=%rdx
    subl    $1, %ecx        #, ivtmp_9
    jne     .L2
    ret

Doing exactly the three requested assembly instructions inside the loop, plus a sub and jne for the loop overhead.

The trick here is that by using the asm volatile¹ and tell the compiler

"r" input operand: it uses the value of result as input so the compiler has to materialize it in a register.
"+r" input/output operand: m stays in the same register but is (potentially) modified.
volatile: it has some mysterious side effect and/or is not a pure function of the inputs; the compiler must execute it as many times as the source does. This forces the compiler to leave your test snippet alone and inside the loop. See the gcc manual's Extended Asm#Volatile section.

footnote 1: The volatile is required here or the compiler will turn this into an empty loop. Non-volatile asm (with any output operands) is considered a pure function of its inputs that can be optimized away if the result is unused. Or CSEd to only run once if used multiple times with the same input.

Everything below is not mine-- and I do not necessarily agree with it. --Carlo Wood

If you had used asm volatile ("" : "=r" (m) : "r" (result)); (with an "=r" write-only output), the compiler might choose the same register for m and result, creating a loop-carried dependency chain that tests the latency, not throughput, of the calculation.

From that, you'd get this asm:

5077:   ba e8 03 00 00          mov    $0x3e8,%edx
507c:   0f 1f 40 00             nopl   0x0(%rax)    # alignment padding
  # top of loop
5080:   48 89 e8                mov    %rbp,%rax    # copy m
5083:   48 f7 d8                neg    %rax         # -m
5086:   48 21 c5                and    %rax,%rbp    # m &= -m   instead of using the tmp as the destination.
5089:   83 ea 01                sub    $0x1,%edx
508c:   75 f2                   jne    5080 <main+0x120>

This will run at 1 iteration per 2 or 3 cycles (depending on whether your CPU has mov-elimination or not.) The version without a loop-carried dependency can run at 1 per clock cycle on Haswell and later, and Ryzen. Those CPUs have the ALU throughput to run at least 4 uops per clock cycle.

This asm corresponds to C++ that looks like this:

for (int i = 0; i < loopsize; ++i)
{
  m = m & -m;
}

By misleading the compiler with a write-only output constraint, we've created asm that doesn't look like the source (which looked like it was computing a new result from a constant every iteration, not using result as an input to the next iteration..)

You might want to microbenchmark latency, so you can more easily detect the benefit of compiling with -mbmi or -march=haswell to let the compiler use blsi %rax, %rax and calculate m &= -m; in one instruction. But it's easier to keep track of what you're doing if the C++ source has the same dependency as the asm, instead of fooling the compiler into introducing a new dependency.

Adkinson answered 17/1, 2019 at 22:14 Comment(13)

The OP is talking about assigning the final result of the whole slow calculation to volatile int answer, not about using volatile inside a hot loop. You're right that you should never do that because it introduces store-forwarding latency. But assigning a final result to volatile, like printing it or returning it from main is a good way to use a result so the compiler doesn't optimize away a whole sum-an-array loop or something. – Landholder 18/1, 2019 at 1:28

In your case, you could just hide the compile-time constant value of m from the compile outside the loop, instead of using asm() to force the compiler to materialize each step of result exactly the way you wrote it. (i.e. you've defeated the possibility of it optimizing the whole loop to popcnt if you were doing result += m & -m.) Repeating a tiny expression in a loop that compiles to a couple instructions is of limited value. You're only measuring throughput, not latency, and with no chance to optimize into surrounding code. – Landholder 18/1, 2019 at 1:31

And most importantly, your asm statement tells the compiler the wrong thing: "=r" tells it that m is a write-only output. Use "+r" (m) for a read-write input/output operand. You happened to get lucky here that the compiler picked the same output register it already had m in, so the resulting asm still made sense. But with any unrolling it might not have. – Landholder 18/1, 2019 at 1:35

I cannot follow your argument about using "+r", all I want is that the compiler thinks that the C++ variable m might have a different value, so it will re-do the calculation every loop iteration. I agree that theoretically it could use a different register for the 'new' m but that only works with loop unrolling. When there is no loop unrolling then the compiler is forced to use the same register anyway (or it did an extremely bad job at optimization because then it will have to move that register afterwards into the register used for m at the top of the loop). – Adkinson 18/1, 2019 at 17:32

As for your second comment, "hiding" the value of m doesn't work; when m doesn't change (and it doesn't) nor do I fake changing it with an asm(), then the calculation of 'm & -m' will be moved outside of the loop, while that is exactly the piece of code that I want to benchmark. – Adkinson 18/1, 2019 at 17:35

Your first remark is entirely correct :/. I placed my "answer" with the wrong question. What I was wrestling with is how to stop the compiler from moving benchmarked code outside a loop (without adding more overhead). I Googled a lot and couldn't find the answer; once I figured it out I picked this SO question based on the headline when Googling for my subject - and I still think it is likely to be found by people who have the same problem as me - but the actual question is different :(. Perhaps I should have created my own question first and then answered it. – Adkinson 18/1, 2019 at 17:43

Using the wrong constraint and having it happen to work is always a bad plan, and is a bad example for future readers on SO. It could break with a more complex loop (more surrounding code), as well as with loop unrolling (which clang does by default, unlike gcc). Anyway, the possible danger is creating a false dependency by for example picking the same register as the "r"(result) input, or not accurately reflecting the extra mov cost required to compute something from m without destroying the original value. – Landholder 19/1, 2019 at 1:17

Inside a tiny loop, you're mostly going to see loop overhead and front-end effects like Is performance reduced when executing loops whose uop count is not a multiple of processor width?, e.g. one extra instruction can cut performance in half on Sandybridge/IvyBridge, if it takes the loop from 4 to 5 uops. So this tiny loop benchmark gives you a very narrow and distorted view of the cost of a C expression as part of a larger block of code. e.g. on Haswell you can't detect the speedup from BMI1 blsr %rax, %rbp – Landholder 19/1, 2019 at 1:19

Oh wait, actually you can, your loop here is destroying the original m because of your bogus constraint, so your asm loop is bottlenecked on latency of mov+neg+and (2 or 3 cycles), not throughput (1 iteration per cycle on Haswell, if not for that loop-carried dependency chain). Exactly because you used a constraint that lied to the compiler about what you wanted, or to put it another way, wrote an asm statement that had an output dependency. (Leaving a register unmodified of course creates a dependency on the old value, which gcc wasn't expecting). – Landholder 19/1, 2019 at 1:26

I fixed your answer to explain the dependency gcc introduces for "=r"(m), and show asm from the right way. I also added a header to explain how this answer differs from the question being asked. Probably a separate Q&A would be better, but Avoid optimizing away variable with inline asm already exists, and there's probably another question even closer to what you're trying to do. – Landholder 19/1, 2019 at 2:36

@PeterCordes Well, thanks for the hard work. But you're either over my head or you're wrong - either way, I have no time to look further into it, so I just added a disclaimer that the part that you added is not mine and left everything in for the rest (no edit wars). – Adkinson 20/1, 2019 at 22:43

That's fine, looks like a good way to separate my probably-too-big edit :P. If you have a Haswell / Broadwell / Skylake CPU, or Ryzen, you'd be able to benchmark the difference if you have time. My version should run at 1 iteration per 1 clock, bottlenecked on throughput (like I think you were trying to), while your version should run at 1 iteration per 2 or 3 clocks, bottlenecked on latency of m &= -m; unless you compile with -mbmi or -march=haswell. Or just look at the asm from actually writing m &= -m; and note it's the same as your version. – Landholder 21/1, 2019 at 13:31

"load and store those variable from and to RAM" You mean from the addressable memory accessible from the current processus unit and not the physical RAM, correct? – Pedersen 29/1, 2019 at 7:49

You could make two C files, SlowCalculation compiled with g++ -O3 (high level of optimization), and the benchmark one compiled with g++ -O1 (lower level, still optimized - that may be sufficient for that benchmarking part).

According to the man page, reordering of code happens during -O2 and -O3 optimizations levels.

Since optimization happens during compilation, not linkage, the benchmark side should not be affected by code reordering.

Assuming you are using g++ - but there should be something equivalent in another compiler.

Maidy answered 23/2, 2013 at 16:29 Comment(2)

That's an interesting idea. It seems likely to keep SlowCalculation from being inlined directly into the benchmark, and that would greatly reduce the chance of the code being reordered. But I'm not sure it's foolproof. – Electrician 24/2, 2013 at 16:12

"Since optimization happens during compilation, not linkage" (1) there is such thing as global optimization (2) if there is no possible late optimization, as the linking is done on pure executable code with no semantic information, or done too late to optimize anything (run time linking), the (1) point is moot. But then so is your suggestion that reordering might happen at some optimization level in the separately compiled benchmark code: the benchmark code that calls separately compiled code cannot assume anything about that code, so it cannot reorder calls to it. – Pedersen 29/1, 2019 at 7:46

The correct way to do this in C++ is to use a class, e.g. something like

class Timer
{
    std::clock_t startTime;
    std::clock_t* targetTime;

public:
    Timer(std::clock_t* target) : targetTime(target) { startTime = std::clock(); }
    ~Timer() { *target = std::clock() - startTime; }
};

and use it like this:

std::clock_t slowTime;
{
    Timer timer(&slowTime);
    int answer = SlowCalculation(42);
}

Mind you, I don't actually believe your compiler will ever re-order like this.

Fimbria answered 25/2, 2013 at 17:22 Comment(0)

There are a couple of ways that I can think of. The idea is to create compile time barriers so that compiler does not reorder a set of instructions.

One possible way to avoid reordering would be to enforce dependency among instructions that cannot be resolved by compiler (e.g. passing a pointer to the function and using that pointer in later instruction). I'm not sure how that would affect the performance of the actual code that you are interested in benchmarking.

Another possibility is to make the function SlowCalculation(42); an extern function (define this function in a separate .c/.cpp file and link the file to your main program) and declare start and stop as global variables. I do not know what are the optimizations offered by the link-time/inter-procedural optimizer of your compiler.

Also, if you compile at O1 or O0, most probably the compiler would not bother reordering instructions. Your question is somewhat related to (Compile time barriers - compiler code reordering - gcc and pthreads)

Sathrum answered 28/2, 2013 at 2:9 Comment(0)

Reordering described by your colleague just breaks 1.9/13

Sequenced before is an asymmetric, transitive, pair-wise relation between evaluations executed by a single thread (1.10), which induces a partial order among those evaluations. Given any two evaluations A and B, if A is sequenced before B, then the execution of A shall precede the execution of B. If A is not sequenced before B and B is not sequenced before A, then A and B are unsequenced . [ Note: The execution of unsequenced evaluations can overlap. —end note ] Evaluations A and B are indeterminately sequenced when either A is sequenced before B or B is sequenced before A, but it is unspecified which. [ Note: Indeterminately sequenced evaluations cannot overlap, but either could be executed first. —end note ]

So basically you should not think about reordering while you don't use threads.

Declivous answered 25/2, 2013 at 17:18 Comment(7)

Even more, any C++ program is guaranteed to be sequentially consistent as long as there are no data races. A data race is when there are more than one thread accessing the same object and at least one thread is a writer. – Glori 25/2, 2013 at 17:21

This answer was a close runner-up for the bounty. – Electrician 2/3, 2013 at 15:57

I should have noted this answer is wrong. The rule here is one of so-called abstract machine semantics rules, which can be bypassed by actual implementation due to the "as-if" rule. However, volatile is one of the exceptions. – Hinds 18/6, 2016 at 9:44

Your assertion "you should not think about reordering while you don't use threads" is wrong. Reordering is still possibly significant in single-threaded programs and it may be not expected. – Hinds 18/6, 2016 at 16:11

@FrankHB, since you are guaranteed to have a sequential behavior("as is" or "as if" — doesn't matter) you don't need to care about it. – Declivous 18/6, 2016 at 16:44

Ideally your claim should be the case. However, the question discovers a dark side of C++ standard: it is actually not guaranteed to work as your imagination. This may be a defect. See here for further discussion. – Hinds 19/6, 2016 at 17:2

@MaximEgorushkin "any C++ program is guaranteed to be sequentially consistent as long as there are no data races" 1) no it isn't, and 2) it's irrelevant here "A data race is when there are more than one thread accessing the same object and at least one thread is a writer" 3) That isn't the definition of a data race and 4) a data race causes UB so 5) you are basically saying that all programs that have a semantic that is restricted in any way have SC executions, which is incorrect – Pedersen 29/1, 2019 at 7:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Everything below is not mine-- and I do not necessarily agree with it. --Carlo Wood

Recommended topics

Hot tags