Enforcing statement order in C++

Asked 13/6, 2016 at 9:47 Answered 3/7, 2019 at 8:14

146

Suppose I have a number of statements that I want to execute in a fixed order. I want to use g++ with optimization level 2, so some statements could be reordered. What tools does one have to enforce a certain ordering of statements?

Consider the following example.

using Clock = std::chrono::high_resolution_clock;

auto t1 = Clock::now(); // Statement 1
foo();                  // Statement 2
auto t2 = Clock::now(); // Statement 3

auto elapsedTime = t2 - t1;

In this example it is important that the statements 1-3 are executed in the given order. However, can't the compiler think statement 2 is independent of 1 and 3 and execute the code as follows?

using Clock=std::chrono::high_resolution_clock;

foo();                  // Statement 2
auto t1 = Clock::now(); // Statement 1
auto t2 = Clock::now(); // Statement 3

auto elapsedTime = t2 - t1;

Pretonic answered 13/6, 2016 at 9:47 Comment(14)

If the compiler thinks they're independent when they're not, the compiler is broken and you should use a better compiler. – Melodramatic 13/6, 2016 at 11:43

open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0342r0.html – Rader 13/6, 2016 at 14:29

could __sync_synchronize() be of any help? – Littoral 13/6, 2016 at 16:30

Try putting foo in a different compilation unit to the function above. This may stop the compiler from being able to analyse it and thus force it to keep the same ordering. – Elfin 13/6, 2016 at 18:35

@HowardHinnant: The semantic power of standard C would be improved tremendously if such a directive were defined, and if the aliasing rules were adjusted to exempt reads performed after a barrier of data which was written before it. – Campeche 13/6, 2016 at 18:40

@LokiAstari Jeremy's answer already showed that effect, but also showed that (predictably) LTO defeats it too. – Erdrich 13/6, 2016 at 18:40

@DavidSchwartz In this case it's about measuring the time foo takes to run, which the compiler is allowed to ignore when reordering, just like it's allowed to ignore observation from a different thread. – Pinole 14/6, 2016 at 7:17

If I understand your answers correctly, reordering of statements can be avoided by putting the statements in functions in different compilation units. However, if LTO is on, this does not work either. Also, the existence of the proposal pointed at by Howard Hinnant – Pretonic 14/6, 2016 at 7:41

indicates there is no perfect way to avoid this issue. I am going to accept the answer of Jeremy. Thank you all for your help. :) – Pretonic 14/6, 2016 at 7:49

@Pinole There is not required to be any such thing as "the time foo takes to run". The compiler is free to rearrange the code such that foo's work is done wherever and whenever it thinks best, interspersing other work that might need to done however it pleases. You will need compiler-specific knowledge to ensure that "the time foo takes to run" even is a thing that could be measured. – Melodramatic 14/6, 2016 at 16:40

@Pretonic read the question and accepted answer in this thread for further insights – Chewning 14/6, 2016 at 18:46

Volatile function pointers are your friends. Also, if you're profiling, use a tool like Valgrind. – Vachon 28/6, 2016 at 7:26

foo() could take t1 as a parameter, then return 0.0 that's added to t2. – Vachon 30/6, 2016 at 13:59

I guess memory barriers are what you are looking for ? en.cppreference.com/w/cpp/atomic/memory_order https://mcmap.net/q/16847/-efficient-memory-barriers – Condescending 1/7, 2016 at 7:0

142

I'd like to try to provide a somewhat more comprehensive answer after this was discussed with the C++ standards committee. In addition to being a member of the C++ committee, I'm also a developer on the LLVM and Clang compilers.

Fundamentally, there is no way to use a barrier or some operation in the sequence to achieve these transformations. The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation. It can simulate them, it knows they cannot be observed by correct programs, and is always free to move them around.

We could try to prevent this, but it would have extremely negative results and would ultimately fail.

First, the only way to prevent this in the compiler is to tell it that all of these basic operations are observable. The problem is that this then would preclude the overwhelming majority of compiler optimizations. Inside the compiler, we have essentially no good mechanisms to model that the timing is observable but nothing else. We don't even have a good model of what operations take time. As an example, does converting a 32-bit unsigned integer to a 64-bit unsigned integer take time? It takes zero time on x86-64, but on other architectures it takes non-zero time. There is no generically correct answer here.

But even if we succeed through some heroics at preventing the compiler from reordering these operations, there is no guarantee this will be enough. Consider a valid and conforming way to execute your C++ program on an x86 machine: DynamoRIO. This is a system that dynamically evaluates the machine code of the program. One thing it can do is online optimizations, and it is even capable of speculatively executing the entire range of basic arithmetic instructions outside of the timing. And this behavior isn't unique to dynamic evaluators, the actual x86 CPU will also speculate (a much smaller number of) instructions and reorder them dynamically.

The essential realization is that the fact that arithmetic isn't observable (even at the timing level) is something that permeates the layers of the computer. It is true for the compiler, the runtime, and often even the hardware. Forcing it to be observable would both dramatically constrain the compiler, but it would also dramatically constrain the hardware.

But all of this should not cause you to lose hope. When you want to time the execution of basic mathematical operations, we have well studied techniques that work reliably. Typically these are used when doing micro-benchmarking. I gave a talk about this at CppCon2015: https://youtu.be/nXaxk27zwlk

The techniques shown there are also provided by various micro-benchmark libraries such as Google's: https://github.com/google/benchmark#preventing-optimization

The key to these techniques is to focus on the data. You make the input to the computation opaque to the optimizer and the result of the computation opaque to the optimizer. Once you've done that, you can time it reliably. Let's look at a realistic version of the example in the original question, but with the definition of foo fully visible to the implementation. I've also extracted a (non-portable) version of DoNotOptimize from the Google Benchmark library which you can find here: https://github.com/google/benchmark/blob/v1.0.0/include/benchmark/benchmark_api.h#L208

#include <chrono>

template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
  asm volatile("" : "+m"(const_cast<T &>(value)));
}

// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }

auto time_foo() {
  using Clock = std::chrono::high_resolution_clock;

  auto input = 42;

  auto t1 = Clock::now();         // Statement 1
  DoNotOptimize(input);
  auto output = foo(input);       // Statement 2
  DoNotOptimize(output);
  auto t2 = Clock::now();         // Statement 3

  return t2 - t1;
}

Here we ensure that the input data and the output data are marked as un-optimizable around the computation foo, and only around those markers are the timings computed. Because you are using data to pincer the computation, it is guaranteed to stay between the two timings and yet the computation itself is allowed to be optimized. The resulting x86-64 assembly generated by a recent build of Clang/LLVM is:

% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
        .text
        .file   "so.cpp"
        .globl  _Z8time_foov
        .p2align        4, 0x90
        .type   _Z8time_foov,@function
_Z8time_foov:                           # @_Z8time_foov
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbx
.Ltmp0:
        .cfi_def_cfa_offset 16
        subq    $16, %rsp
.Ltmp1:
        .cfi_def_cfa_offset 32
.Ltmp2:
        .cfi_offset %rbx, -16
        movl    $42, 8(%rsp)
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, %rbx
        #APP
        #NO_APP
        movl    8(%rsp), %eax
        addl    %eax, %eax              # This is "foo"!
        movl    %eax, 12(%rsp)
        #APP
        #NO_APP
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        subq    %rbx, %rax
        addq    $16, %rsp
        popq    %rbx
        retq
.Lfunc_end0:
        .size   _Z8time_foov, .Lfunc_end0-_Z8time_foov
        .cfi_endproc


        .ident  "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
        .section        ".note.GNU-stack","",@progbits

Here you can see the compiler optimizing the call to foo(input) down to a single instruction, addl %eax, %eax, but without moving it outside of the timing or eliminating it entirely despite the constant input.

Hope this helps, and the C++ standards committee is looking at the possibility of standardizing APIs similar to DoNotOptimize here.

Gimpel answered 25/6, 2016 at 6:48 Comment(21)

Thank you for your answer. I have marked it as the new best answer. I could have done this earlier, but I have not read this stackoverflow page for many months. I am very interested in using the Clang compiler to make C++ programs. Among other things, I like that one can use Unicode characters in variable names in Clang. I think I will ask more questions about Clang on Stackoverflow. – Pretonic 10/12, 2016 at 9:38

While I understand how this prevents foo being optimized away completely, can you elaborate a bit why this prevents the calls to Clock::now() being reordered relative to foo()? Does the optimzer have to assume that DoNotOptimize and Clock::now() have access to and might modify some common global state which in turn would tie them to the in- and output? Or are you relying on some current limitations of the optimizer's implementation? – Arteritis 22/2, 2017 at 11:43

DoNotOptimize in this example is a synthetically "observable" event. It is as if it notionally printed visible output to some terminal with the input's representation. Since reading the clock is also observable (you're observing time passing) they cannot be re-ordered without changing the observable behavior of the program. – Gimpel 26/3, 2017 at 2:38

I'm still not quite clear with the concept "observable", if the foo function is doing some operations like reading from a socket which may be blocked for a while, does this count a observable operation? And since the read is not a "totally known" operation (right?), will the code keep in order? – Potent 18/9, 2018 at 8:39

"The fundamental problem is that the operational semantics of something like an integer addition are totally known to the implementation." But it seems to me that the issue isn't the semantics of integer addition, it's the semantics of calling function foo(). Unless foo() is in the same compilation unit, how does it know that foo() and clock() don't interact? – Dennard 20/2, 2019 at 15:10

@Arteritis until someone comes with a precise quote from the C++ standard, I would assert that the first reordering cannot happen by passing t1 as a new input parameter to DoNotOptimize. But I'm not sure how to prevent the reordering of the second one since Clock::now() takes no inputs! I think I would move it into a __attribute__ ((noinline)) function that takes an input. Or maybe call ::now from assembly, but that is hard: stackoverflow.com/questions/37502841/… – Flake 26/6, 2019 at 15:59

OK, I went ahead and created a paranoid example with full data dependencies which I'm pretty sure cannot be reordered: stackoverflow.com/questions/37786547/… – Flake 3/7, 2019 at 8:54

The only problem that's left with DoNotOptimize() is, that the input needs to be reloaded from memory and the output needs to be written to memory within the critical region. For a simple one-cycle-latency instruction like addl, that's significant overhead. Just like the calling of the timer function is significant overhead. I think, the only way to eliminate these overheads from your measurement is to time a loop containing the operation of interest (with the input and output properly DoNotOptimize()ed). – Furore 8/11, 2019 at 10:0

DoNotOptimize works cause volatile access enforces strict sequentiality. – Verbalize 5/3, 2020 at 7:40

@cmaster-reinstatemonica: Even a call via volatile-qualified function pointer need not always enforce ordering. A compiler could eagerly perform some computations and store results in a private place before reading the volatile-qualified pointer, check whether the pointer identifies a particular function the compiler has optimized for this particular call site, and either call the ordinary version of function identified by the pointer or invoke a version which is optimized for the particular call site, and which exploits the eagerly-performed computations. – Campeche 23/3, 2021 at 22:7

@Campeche The point about DoNotOptimize() is that it declares the memory behind its argument as being changed by the assembler code. As such, the compiler is forced to a) reload the input value from memory after the call, b) to not let any computation within the critical section rely on a stale version of the input, and c) to write the result to memory before the second DoNotOptimize() call. All computations that rely on input and are relevant to output must be sequenced between these two calls. If you fail to include I/O variables in the DoNotOptimize() calls, that's your mistake. – Furore 23/3, 2021 at 23:51

@cmaster-reinstatemonica: Would anything forbid a "clever" compiler from transforming int x = operationThatProduces5(); DoNotOptimize(x); doSomething(x); into int x = operationThatProduces5(); DoNotOptimize(x); if (x == 5) doSomething(5); else doSomething(x);? – Campeche 4/8, 2021 at 21:21

@Campeche As long as the x in if(x == 5) is reloaded from memory after the DoNotOptimize(x) call, the compiler is free to do any kind of insane stuff to perform the doSomething(x) call. So, yes, it would be possible for the CPU to speculatively execute doSomething(5) before x has been reloaded from memory. However, the compiler has no reason to believe that x == 5 would remain true after the DoNotOptimize(x) call, and will not introduce the danger of a pipeline flush due to misprediction of if(x == 5) if it is any good. – Furore 5/8, 2021 at 5:6

@cmaster-reinstatemonica: In many use cases, after a sequence like the above, the value of x would be 5 more than 25% of the time, if not more than 50%. Thus, having a compiler evaluate the cost of a hard-coded x==5 variant and perform the comparison/substitution if it would be 80% or more cheaper than the general-case version could be a very sensible optimization if the compiler didn't use the text of the Standard as an process code in a way contrary to the intention of the Standard's authors as documented in the Rationale. Fundamentally, the Standard was never designed to draw... – Campeche 5/8, 2021 at 14:52

...a clear line between constructs that all implementations must support and those that will never be used by non-broken programs, because no single partitioning could be suitable for more than a small fraction of the purposes to which C is put. While some compiler maintainers seem to need bright-line rules, such rules could only be written sensibly if the Standard weakened a couple of long-standing principles, most notably: 1. Replace the idea that all meaningfully-conforming programs should be expected to run meaningfully on all implementations with a requirement that... – Campeche 5/8, 2021 at 14:58

...conforming implementations must reject any meaningfully-conforming programs that they cannot otherwise process meaningfully, so constructs that are needed for some tasks but not others could be used in meaningfully-conforming programs to accomplish those tasks, but implementations not intended for those tasks need not support them. 2. Use an abstraction model in which certain optimizations may have observable effects; a program whose actions make such effects observable would be a correct program if all allowable combinations of effects would yield behavior meeting requirements. – Campeche 5/8, 2021 at 15:3

@Campeche What does the standard have to do with that? Afaik, the contents of an asm statement is not standardized in any way. However, the compilers do describe its syntax and semantics, especially of the "+m" directive: It means that the compiler will consider the memory behind the pointer to be both input and output of the (empty) assembler code, and as such, the compiler must assume that the value of x does change. To the compiler, the chance of x remaining the same is only 2^-64, assuming a 64 bit value. It should definitely no try to optimize that. – Furore 5/8, 2021 at 15:16

If e.g. code uses a float* to store a value, and then uses a compiler-defined barrier, and then uses a uint32_t* to read the bits of the storage, the Standard would not recognize any effect the barrier might have in preventing the read from invoking Undefined Behavior. If, however, the Standard were to replace N1570 6.5p7 with a rule that instead said that a read of a uint32_t* may generally be regarded as unsequenced relative to a preceding or following write of type float*, then the effects of the barrier would be clear even if the Standard knew nothing about it. – Campeche 5/8, 2021 at 15:25

@cmaster-reinstatemonica: Further, although a compiler wouldn't be entitled to ignore the possibility of x changing in your barrier scenario, a compiler that processes x = 5; barrier(); doSomething(x); as x = 5; barrier(); if (x==5) doSomething(5); else doSomething(x); would not be making a correctness-threatening assumption that x won't change, but rather a performance-affecting assumption that x won't always change, possibly generating code that will perform worse than straightforward code if x changes more than 75% of the time, but better if it doesn't. – Campeche 5/8, 2021 at 15:48

@Campeche Ah, now I see why you see the standard relevant in this. As to the "optimization" that the compiler is allowed to do: Yes, the compiler has every right to emit the if(x == 5) code you gave. Nevertheless, it would be a pretty daft thing to do: The whole point point of the asm statement part after the : is to allow the savy programmer to specify what the compiler can assume, and what it cannot assume. This syntax is not used all over the place (= low optimization necessity), and is used by pretty hard-core low-level programmers who can read and use specs (= best treated as gods). – Furore 5/8, 2021 at 19:58

@cmaster-reinstatemonica: Many of the "optimizations" that programmers complain about today would have been universally recognized as downright obtuse when the Standard is written, and thus there was no need to have the Standard forbid them. Compiler writers, however, often seem to care much more about optimizations than their users do, and regard the fact that a construct would block an optimization the they want to perform as a being problem with the construct, rather than being the reason it was used in the first place. – Campeche 5/8, 2021 at 20:7

Summary:

There seems to be no guaranteed way to prevent reordering, but as long as link-time/full-program optimisation is not enabled, locating the called function in a separate compilation unit seems a fairly good bet. (At least with GCC, although logic would suggest that this is likely with other compilers too.) This comes at the cost of the function call - inlined code is by definition in the same compilation unit and open to reordering.

Original answer:

GCC reorders the calls under -O2 optimisation:

#include <chrono>
static int foo(int x)    // 'static' or not here doesn't affect ordering.
{
    return x*2;
}
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

GCC 5.3.0:

g++ -S --std=c++11 -O0 fred.cpp :

_ZL3fooi:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %ecx, 16(%rbp)
        movl    16(%rbp), %eax
        addl    %eax, %eax
        popq    %rbp
        ret
_Z4fredi:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        movl    %ecx, 16(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -16(%rbp)
        movl    16(%rbp), %ecx
        call    _ZL3fooi
        movl    %eax, -4(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -32(%rbp)
        movl    -4(%rbp), %eax
        addq    $64, %rsp
        popq    %rbp
        ret

But:

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        call    _ZNSt6chrono3_V212system_clock3nowEv
        leal    (%rbx,%rbx), %eax
        addq    $32, %rsp
        popq    %rbx
        ret

Now, with foo() as an extern function:

#include <chrono>
int foo(int x);
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %ecx
        call    _Z3fooi
        movl    %eax, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %eax
        addq    $32, %rsp
        popq    %rbx
        ret

BUT, if this is linked with -flto (link-time optimisation):

0000000100401710 <main>:
   100401710:   53                      push   %rbx
   100401711:   48 83 ec 20             sub    $0x20,%rsp
   100401715:   89 cb                   mov    %ecx,%ebx
   100401717:   e8 e4 ff ff ff          callq  100401700 <__main>
   10040171c:   e8 bf f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401721:   e8 ba f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401726:   8d 04 1b                lea    (%rbx,%rbx,1),%eax
   100401729:   48 83 c4 20             add    $0x20,%rsp
   10040172d:   5b                      pop    %rbx
   10040172e:   c3                      retq

Splanchnic answered 13/6, 2016 at 12:24 Comment(19)

So does MSVC and ICC. Clang is the only one that seems to preserve the original sequence. – Celt 13/6, 2016 at 12:37

With asm volatile("" ::: "memory"); does it prevent the ordering? – Phototype 13/6, 2016 at 13:13

+1 for a good example of reordering. But I wish you added another sentence about how to prevent that from happening. Can we chop this up into several small functions, and can we assume one function finishes before the next one starts even if it's an inline function, or do we need "volatile" or "std::atomic_thread_fence" to really be sure statements occur in a particular order? – Fluoroscopy 13/6, 2016 at 13:24

With the code more as presented in the OP, gcc seems to do the right thing; godbolt.org/g/C93rRK – Faa 13/6, 2016 at 13:24

@Faa - As long as foo() is in a separate compilation unit this seems to be the case. If foo() is defined in the same compilation unit the reordering still happens - presumably because in this latter case the compiler can determine that foo() is not affected by any potential side effects of now(), whereas in the former case it can't. – Splanchnic 13/6, 2016 at 13:38

I agree with David Cary that this is a good example and I would also like to see information about ways to avoid reordering. – Pretonic 13/6, 2016 at 13:46

you don't use t1 and t2 anywhere so it may think the result can be discarded and reorder the code – Staffan 13/6, 2016 at 13:49

Even if the time results are used, the reordering happens in the presence of a definition of the function between the two time calls... godbolt.org/g/EmvZMy – Faa 13/6, 2016 at 13:51

@Faa - I can't offer anything more concrete, but I think my comment alludes to the underlying reason: The compiler knows that foo() cannot affect now(), nor vice versa, and so does the reordering. Various experiments involving extern scope functions and data seem to confirm this. This includes having static foo() depend on a file-scope variable N - if N is declared as static, reordering occurs, whereas if it's declared non-static (i.e. it's visible to other compilation units, and hence potentially subject to side effects of extern functions such as now()) reordering does not occur. – Splanchnic 13/6, 2016 at 13:54

@ Lưu Vĩnh Phúc: Except that the calls themselves are not elided. Once again, I suspect this is because the compiler doesn't know what their side effects might be - but it does know that those side effects cannot influence the behaviour of foo(). – Splanchnic 13/6, 2016 at 13:57

And a final note: specifying -flto (link-time optimisation) causes reordering even in otherwise non-reordered cases. – Splanchnic 13/6, 2016 at 14:1

A declaration of an integer, and a call to some foo(), are not in any way the same thing. – Anselmi 13/6, 2016 at 14:35

@Lightness - Agreed, hence my subsequent comments. – Splanchnic 13/6, 2016 at 14:45

Modified my example to match the question. Results and conclusions (most of which were based on this modified code anyway) remain valid. – Splanchnic 13/6, 2016 at 14:58

auto fred(volatile int x) {     auto t1 = std::chrono::high_resolution_clock::now();     volatile int y = foo(x);     auto t2 = std::chrono::high_resolution_clock::now();     return t2-t1; }

seems not to become reordered. – Ketone 13/6, 2016 at 20:35

Sorry, in C++11 we need decltype(std::chrono::high_resolution_clock::now()-std::chrono::high_resolution_clock::now()) fred(volatile int x) instead of auto fred(volatile int x). – Ketone 13/6, 2016 at 21:7

Reordering of assembly often occurs with the higher optimization settings, and is often done by analyzing memory/register access independant of the original source code. As well, some sets of less-optimal instructions can be replaced by a more optimal set of instructions. These are sometimes called 'peep-hole optimizations', because the optimizer is only looking at that small section of instructions in an effort to find patterns that can be optimized. – Catling 13/6, 2016 at 23:32

How does this answer the question? – Makedamakefast 14/6, 2016 at 8:55

Should be noted though, that no matter what the compiler does or does not, there is also a processor reordering which can also have surprising effects. That is less visible with Intel processors than with ARM processors. – Basically 25/6, 2016 at 6:57

Reordering may be done by the compiler, or by the processor.

Most compilers offer a platform-specific method to prevent reordering of read-write instructions. On gcc, this is

asm volatile("" ::: "memory");

(More information here)

Note that this only indirectly prevents reordering operations, as long as they depend on the reads / writes.

In practice I haven't yet seen a system where the system call in Clock::now() does have the same effect as such a barrier. You could inspect the resulting assembly to be sure.

It is not uncommon, however, that the function under test gets evaluated during compile time. To enforce "realistic" execution, you may need to derive input for foo() from I/O or a volatile read.

Another option would be to disable inlining for foo() - again, this is compiler specific and usually not portable, but would have the same effect.

On gcc, this would be __attribute__ ((noinline))

@Ruslan brings up a fundamental issue: How realistic is this measurement?

Execution time is affected by many factors: one is the actual hardware we are running on, the other is concurrent access to shared resources like cache, memory, disk and CPU cores.

So what we usually do to get comparable timings: make sure they are reproducible with a low error margin. This makes them somewhat artificial.

"hot cache" vs. "cold cache" execution performance can easily differ by an order of magnitude - but in reality, it will be something inbetween ("lukewarm"?)

Norword answered 13/6, 2016 at 10:17 Comment(3)

Your hack with asm affects execution time of the statements between timer calls: the code after memory clobber has to reload all variables from memory. – Tanyatanzania 13/6, 2016 at 11:7

@Ruslan: Their hack, not mine. There are different levels of purging, and doing something like that is unavoidable for reproducible results. – Norword 13/6, 2016 at 13:3

Note that the hack with 'asm' only helps as a barrier for operations which touch memory, and the OP is interested in more than that. See my answer for more details. – Gimpel 27/6, 2016 at 20:1

The C++ language defines what is observable in a number of ways.

If foo() does nothing observable, then it can be eliminated completely. If foo() only does a computation that stores values in "local" state (be it on the stack or in an object somewhere), and the compiler can prove that no safely-derived pointer can get into the Clock::now() code, then there are no observable consequences to moving the Clock::now() calls.

If foo() interacted with a file or the display, and the compiler cannot prove that Clock::now() does not interact with the file or the display, then reordering cannot be done, because interaction with a file or display is observable behavior.

While you can use compiler-specific hacks to force code not to move around (like inline assembly), another approach is to attempt to outsmart your compiler.

Create a dynamically loaded library. Load it prior to the code in question.

That library exposes one thing:

namespace details {
  void execute( void(*)(void*), void *);
}

and wraps it like this:

template<class F>
void execute( F f ) {
  struct bundle_t {
    F f;
  } bundle = {std::forward<F>(f)};

  auto tmp_f = [](void* ptr)->void {
    auto* pb = static_cast<bundle_t*>(ptr);
    (pb->f)();
  };
  details::execute( tmp_f, &bundle );
}

which packs up a nullary lambda and uses the dynamic library to run it in a context that the compiler cannot understand.

Inside the dynamic library, we do:

void details::execute( void(*f)(void*), void *p) {
  f(p);
}

which is pretty simple.

Now to reorder the calls to execute, it must understand the dynamic library, which it cannot while compiling your test code.

It can still eliminate foo()s with zero side effects, but you win some, you lose some.

Tremendous answered 13/6, 2016 at 14:45 Comment(3)

"another approach is to attempt to outsmart your compiler" If that phrase isn't a sign of having gone down the rabbit hole, I don't know what is. :-) – Celt 13/6, 2016 at 14:54

I think it might be helpful to note that the time required for a block of code to execute is not considered an "observable" behavior which compilers are required to maintain. If the time to execute a block of code were "observable", then no forms of performance optimization would be permissible. While it would be helpful for C and C++ to define a "causality barrier" which would require a compiler to hold off on executing any code after the barrier until all side-effects from before the barrier had been handled by the generated code [code which wants to ensure that data has fully... – Campeche 13/6, 2016 at 18:12

...propagated through hardware caches would need to use hardware-specific means to do that, but a hardware-specific means of waiting until all posted writes were complete would be useless without a barrier directive to ensure that all pending writes tracked by the compiler must get posted to the hardware before the hardware is asked to ensure that all posted writes are complete.] I know of no way of doing that in either language without using a dummy volatile access or call to outside code. – Campeche 13/6, 2016 at 18:15

No it can't. According to the C++ standard [intro.execution]:

14 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.

A full-expression is basically a statement terminated by a semicolon. As you can see the above rule stipulates statements must be executed in order. It is within statements that the compiler is allowed more free rein (i.e. it is under some circumstance allowed to evaluate expressions that make up a statement in orders other than left-to-right or anything else specific).

Note the conditions for the as-if rule to apply are not met here. It is unreasonable to think that any compiler would be able to prove that reordering calls to get the system time would not affect observable program behaviour. If there was a circumstance in which two calls to get the time could be reordered without changing observed behaviour, it would be extremely inefficient to actually produce a compiler that analyses a program with enough understanding to be able to infer this with certainty.

Clairvoyance answered 13/6, 2016 at 9:56 Comment(27)

There's still the as-if rule though – Napkin 13/6, 2016 at 9:58

By as-if rule compiler can do anything to code as long as it does not change observable behavior. Time of execution is not observable. So it can reorder arbutrary lines of code as long as result would be same (most compiler do sensible thing and not reorder time calls, but it is not required) – Hickey 13/6, 2016 at 10:0

Time of execution is not observable. This is quite strange. From a practical, non-technical point of view, time of execution (a.k.a. "performance") is very observable. – Oscaroscillate 13/6, 2016 at 10:6

@FrédéricHamidi: But it's not a metric by which C++ defines program behaviour. – Anselmi 13/6, 2016 at 10:6

Depends on how you measure time. It is not possible to measure the number of clock cycles taken to execute some body of code in standard C++. – Seeress 13/6, 2016 at 10:7

Problem is that the "as if" rule means that this answer doesn't always address the question. The compiler is allowed to reorder statements in order to optimise execution time, as long as the observable output of the program (e.g. to files) doesn't change. – Seeress 13/6, 2016 at 10:8

I don't buy the reasoning for the "as if' rule -- the same logic would imply that the compiler can't invoke it to do any optimization between two calls to the clock either, or really, to do any optimizations at all. – Annapolis 13/6, 2016 at 10:24

@Hurkyl No, I would put it differently - any optimisation between clock calls stays between clock calls. – Clairvoyance 13/6, 2016 at 10:29

Execution time is not observable so "observable behaviour clearly would be affected" is wrong. In addition I would expect, say Microsoft compiler compiling for Win system to have an extensive knowledge about system calls and be able to optimise them, so previous paragraph is questionable too. – Hickey 13/6, 2016 at 10:37

@Hickey - fine, I have to agree that I haven't justified the statement "observable behaviour would be affected" enough, so I removed it. However, the rest of the point still holds. I would be very surprised if any compiler allowed itself this optimisation. – Clairvoyance 13/6, 2016 at 10:46

Note I'm not claiming it is theoretically impossible, I'm talking about a practical point here. – Clairvoyance 13/6, 2016 at 10:46

I agree about impracticality of such behavior. I was talking from language lawyer position. – Hickey 13/6, 2016 at 10:49

@revolver It does not. Microsoft's compiler assumes nothing about calls to Win32 API functions. It is a bit smarter about library functions, providing intrinsics and inlining code, etc., but GCC and Clang both do exactly the same thing there. It could not possibly do so, otherwise a compiler would be tied to a particular version of the operating system. – Celt 13/6, 2016 at 10:50

@Cody Microsoft compilers are tied to a particular version of operating system. For example, you cannot generate a Win16 application with the latest VS. Worse yet, you probably will have a hard time making the latest VS generate something that would run on NT 4. Hell, I think they already dropped XP, and that is NT 5. – Eurythmic 13/6, 2016 at 12:58

@dba You're mixing a few things together. The linker can no longer generate Win16 applications, that's true enough, but that's because they have removed support for generating that type of binary. WIn16 apps don't use the PE format. That doesn't imply that either the compiler or linker has special knowledge about the API functions. The other issue is related to the runtime library. There is absolutely no problem getting the latest version of MSVC to generate a binary that runs on NT 4. I have done it. The problem comes as soon as you try to link in the CRT, which calls functions not available. – Celt 13/6, 2016 at 13:1

@Cody doesn't the CRT get linked dynamically before execution on the target platform? Anyway, I was not claiming what I just said necessarily implies MSVC assumes something about the calls to Win32, I am just showing the contrary to the consequent is not implied by what you said, by counterexample. – Eurythmic 13/6, 2016 at 13:12

@dba The CRT can be linked either statically or dynamically, but that is not relevant. The problem is that there are different versions of the CRT, and newer versions of the CRT call Windows API functions that are not available on older versions of the OS. Like the version of the CRT bundled with VS 2010 calls EncodePointer, which doesn't exist on Win 2K. If you linked the CRT statically, then it would fail because EncodePointer doesn't exist. If you linked it dynamically, then it would fail because the CRT DLL is missing. What I'm saying is that you have not provided a counterexample. – Celt 13/6, 2016 at 13:14

If you use a different version of the CRT, one that supports Win NT 4, then it will work fine, no matter which version of the compiler that you compile it with. Or if you stub out EncodePointer, providing your own implementation of it. From the compiler's perspective, the CRT, the Windows API, and your own functions are all black boxes. It has no special knowledge about what they do or what they mean. They're just functions that conform to a particular ABI. – Celt 13/6, 2016 at 13:16

@Cody you have now just produced an elaboration on my counterexample to the belief that (especially Microsoft's) compilers are not tied to operating system versions (like, come on, MSVC is limited to one flavour of NT (well, not true anymore since they have Windows Phone and Surface and so on, which are ARM)). You cannot prove MSVC assumes nothing about calls to Win32 API function from a belief for which a valid counterexample exists. – Eurythmic 13/6, 2016 at 13:21

@dba I guess I don't understand what you are saying. I still feel like you are mixing together multiple concepts. The linker is designed to generate a single type of binary, a binary format supported only by Windows NT. Same for the x86 instruction set. But that has nothing to do with having special knowledge about API functions and optimizing accordingly. It does not do that. The runtime library uses API functions, but so can your code. The compiler doesn't care; it makes no distinction between your code and the CRT's code. The statement of mine that you quoted is still absolutely valid. – Celt 13/6, 2016 at 13:26

@Cody it's sad you don't understand what am I saying and keep digging into the semantics. I'm not saying what I quoted is wrong, I'm saying its premise is wrong. Additionally I think @revolver's original expectation of compilers made exclusively for a platform being able to figure stuff about that platform out is reasonable and sound, at least theoretically, if not even practically, as previously shown. – Eurythmic 13/6, 2016 at 13:47

@Eurythmic I'm not sure either what you're trying to argue. You clearly stated "Microsoft compilers are tied to a particular version of operating system". This is provably wrong, since you can easily generate binaries with VS2015 that will run on five different operating systems (and VS2008 can generate perfectly fine binaries for Win10 - clearly without a time machine that'd be rather complicated to pull off). – Agueda 13/6, 2016 at 17:29

If your claim is that MSVC has hardcoded information about API calls in it.. extraordinary claims require extraordinary proof. I mean think about the maintenance hassle that would imply and the little advantage it would bring (the kernel and the compiler are written by two completely different teams at MS, they don't even know each other for the most part). Different versions of MSVC have different intrinsics and behavior and the compiler is generally intimately linked to the runtime (you can get around that though) but that's about it. – Agueda 13/6, 2016 at 17:32

+1 for being the only answer so far to refer to the standard (whether or not the interpretation is correct!). – Naranjo 16/6, 2016 at 8:22

@Agueda please re-read my comments starting with the first one which precisely addresses your issue. – Eurythmic 17/6, 2016 at 23:50

On your second comment: no, that is not my claim, please, again, read carefully. I am not stating anything I have not stated before, but here it still goes: @Cody's claim was X. He tried to show X by showing it follows from Y. But that does not work, because Y is not true. I'm not claiming anything about X, I'm only claiming you cannot show X using Y, because Y itself doesn't hold. – Eurythmic 17/6, 2016 at 23:56

Not sure why I needed to get pinged back into this discussion, but if Y is the initial claim that you linked to, I'm pretty sure I have already conclusively disproven that. None of Microsoft's compilers are tied to a specific OS version. The linker is tied to a specific executable format/specification, which happens to only be supported by (32-bit and later) Windows NT, but that has nothing to do with the compiler and requires absolutely no internal knowledge of the OS's workings, since this is an open standard. And the runtime library makes OS-specific calls, but so too does user code. – Celt 19/6, 2016 at 8:33

noinline function + inline assembly black box + full data dependencies

This is based on https://mcmap.net/q/16704/-enforcing-statement-order-in-c but because I didn't see any clear justification of why the ::now() cannot be reordered there, I would rather be paranoid and put it inside a noinline function together with the asm.

This way I'm pretty sure the reordering cannot happen, since the noinline "ties" the the ::now and the data dependency.

main.cpp

#include <chrono>
#include <iostream>
#include <string>

// noinline ensures that the ::now() cannot be split from the __asm__
template <class T>
__attribute__((noinline)) auto get_clock(T& value) {
    // Make the compiler think we actually use / modify the value.
    // It can't "see" what is going on inside the assembly string.
    __asm__ __volatile__ ("" : "+g" (value));
    return std::chrono::high_resolution_clock::now();
}

template <class T>
static T foo(T niters) {
    T result = 42;
    for (T i = 0; i < niters; ++i) {
        result = (result * result) - (3 * result) + 1;
    }
    return result;
}

int main(int argc, char **argv) {
    unsigned long long input;
    if (argc > 1) {
        input = std::stoull(argv[1], NULL, 0);
    } else {
        input = 1;
    }

    // Must come before because it could modify input
    // which is passed as a reference.
    auto t1 = get_clock(input);
    auto output = foo(input);
    // Must come after as it could use the output.
    auto t2 = get_clock(output);
    std::cout << "output " << output << std::endl;
    std::cout << "time (ns) "
              << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
              << std::endl;
}

GitHub upstream.

Compile and run:

g++ -ggdb3 -O3 -std=c++14 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out 1000
./main.out 10000
./main.out 100000

The only minor downside of this method is that we add one extra callq instruction over an inline method. objdump -CD shows that main contains:

    11b5:       e8 26 03 00 00          callq  14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
    11ba:       48 8b 34 24             mov    (%rsp),%rsi
    11be:       48 89 c5                mov    %rax,%rbp
    11c1:       b8 2a 00 00 00          mov    $0x2a,%eax
    11c6:       48 85 f6                test   %rsi,%rsi
    11c9:       74 1a                   je     11e5 <main+0x65>
    11cb:       31 d2                   xor    %edx,%edx
    11cd:       0f 1f 00                nopl   (%rax)
    11d0:       48 8d 48 fd             lea    -0x3(%rax),%rcx
    11d4:       48 83 c2 01             add    $0x1,%rdx
    11d8:       48 0f af c1             imul   %rcx,%rax
    11dc:       48 83 c0 01             add    $0x1,%rax
    11e0:       48 39 d6                cmp    %rdx,%rsi
    11e3:       75 eb                   jne    11d0 <main+0x50>
    11e5:       48 89 df                mov    %rbx,%rdi
    11e8:       48 89 44 24 08          mov    %rax,0x8(%rsp)
    11ed:       e8 ee 02 00 00          callq  14e0 <auto get_clock<unsigned long long>(unsigned long long&)>

so we see that foo was inlined, but get_clock were not and surround it.

get_clock itself however is extremely efficient, consisting of a single leaf call optimized instruction that doesn't even touch the stack:

00000000000014e0 <auto get_clock<unsigned long long>(unsigned long long&)>:
    14e0:       e9 5b fb ff ff          jmpq   1040 <std::chrono::_V2::system_clock::now()@plt>

Since the clock precision is itself limited, I think that is unlikely that you will be able to notice the timing effects of one extra jmpq. Note that one call is required regardless since ::now() is in a shared library.

Call ::now() from inline assembly with a data dependency

This would be the most efficient solution possible, overcoming even the extra jmpq mentioned above.

This is unfortunately extremely hard to do correctly as shown at: Calling printf in extended inline ASM

If your time measurement can be done directly in inline assembly without a call however, then this technique can be used. This is the case for example for gem5 magic instrumentation instructions, x86 RDTSC (not sure if this is representative anymore) and possibly other performance counters.

Related threads:

Is it legal for a C++ optimizer to reorder calls to clock()?

Tested with GCC 8.3.0, Ubuntu 19.04.

Flake answered 3/7, 2019 at 8:14 Comment(4)

You normally don't need to force a spill/reload with "+m", using "+r" is a much more efficient way to make the compiler materialize a value and then assume the variable has changed. – Toxoplasmosis 24/10, 2019 at 11:59

@PeterCordes And __volatile__ may be unnecessary. Independently discovered that asm("":"+v"(value)); is enough to prevent double value computation from being reordered around this statement. No non-inline function is necessary either. – Karleen 10/3 at 17:8

@MaximEgorushkin: There's no data dependency from value to chrono::...::now() so a non-inline function or at least an asm with a "memory"-clobber is necessary. (now() itself being non-inline means the compiler has to respect its ordering wrt. an asm statement that reads+writes any/all globally-reachable memory. As long as that asm statement isn't optimized away, which it could be if it's not volatile and value isn't used later.). If you do use value later, you can indeed omit volatile, and yes a "memory" clobber is more efficient than wrapping with a non-inline func. – Toxoplasmosis 10/3 at 17:59

@PeterCordes You are quite right. My comment was about a tangential issue when computing and loading value was reordered to happen much earlier which resulted in sub-optimal instruction scheduling in my case, solved by declaring a sole read-write register constraint on value and nothing else. – Karleen 10/3 at 20:53

-1

No.

Sometimes, by the "as-if" rule, statements may be re-ordered. This is not because they are logically independent of each other, but because that independence allows such a re-ordering to occur without changing the semantics of the program.

Moving a system call that obtains the current time obviously does not satisfy that condition. A compiler that knowingly or unknowingly does so is non-compliant and really silly.

In general, I wouldn't expect any expression that results in a system call to be "second-guessed" by even an aggressively optimizing compiler. It just doesn't know enough about what that system call does.

Anselmi answered 13/6, 2016 at 10:5 Comment(8)

I agree that it would be silly, but I wold not call it non-conformant. Compiler can have knowledge what system call on concrete system exactly does and if it has side effects. I would expect compilers to not reorder such call just to cover common use case, allowing for better user experience, not because standard prohibits it. – Hickey 13/6, 2016 at 10:12

Might want to note that it is possible to reduce the odds of the compiler reordering statements. For example, calling a function that is defined in another compilation reduces the odds of statements before the function call being shifted so they are executed after - or vice versa. Similarly with inline assembler (although I've seen mention that gcc is smart enough to understand inline assembler in some cases). The only way to be sure, however, is to examine code emitted by the compiler. – Seeress 13/6, 2016 at 10:15

@Seeress gcc never analyzes inline assembly code. It can only do some optimizations based on data in extended asm syntax — clobber list, inputs, outputs etc. – Tanyatanzania 13/6, 2016 at 11:5

@Revolver_Ocelot: Optimisations that change the semantics of the program (okay, save for copy elision) are non-compliant to the standard, whether you agree or not. – Anselmi 13/6, 2016 at 11:31

In the trivial case of int x = 0; clock(); x = y*2; clock(); there are no defined ways for the clock() code to interact with the state of x. Under the C++ standard, it doesn't have to know what clock() does -- it could examine the stack (and notice when the computation occurs), but that isn't C++'s problem. – Tremendous 13/6, 2016 at 14:33

To take Yakk's point further: it's true that re-ordering the system calls, so that the result of the first is assigned to t2 and the second to t1, would be non-conforming and silly if those values are used, what this answer misses is that a conforming compiler can sometimes re-order other code across a system call. In this case, provided it knows what foo() does (for example because it has inlined it) and hence that (loosely speaking) it's a pure function then it can move it around. – Geraldina 13/6, 2016 at 22:38

.. again loosely speaking, this is because there's no guarantee that the actual implementation (albeit not the abstract machine) won't speculatively calculate y*y before the system call, just for fun. There is also no guarantee that the actual implementation won't use the result of this speculative calculation later at whatever point x is used, therefore doing nothing between the calls to clock(). The same goes for whatever an inlined function foo does, provided it has no side-effects and cannot depend on state that might be altered by clock(). – Geraldina 13/6, 2016 at 22:45

Can you back up Moving a system call that obtains the current time obviously does not satisfy that condition (i.e. without changing semantics) . A compiler that … does so is non-compliant … . with a quote? It all depends what the standard means by “semantics” (!) (if it uses that term); I agree that time should be considered observable but the problem is, I think, that foo() means “make the results defined according to the standard by the implementation of foo available to ‘subsequent’ code” — and results defined according to the standard tend not to include elapsing of time. – Naranjo 16/6, 2016 at 8:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags