Google Benchmark Frameworks DoNotOptimize
Asked Answered
F

1

11

I am a bit confused about the implementation of the function void DoNotOptimize of the Google Benchmark Framework (definition from here):

template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp const& value) {
  asm volatile("" : : "r,m"(value) : "memory");
}

template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
#if defined(__clang__)
  asm volatile("" : "+r,m"(value) : : "memory");
#else
  asm volatile("" : "+m,r"(value) : : "memory");
#endif
}

So it materializes the variable, and if non-constant, also tells the compiler to forget anything about its previous value. ("+r" is an RMW operand).

And also always uses a "memory" clobber, which is a compiler barrier against reordering loads/stores, i.e. make sure all globally-reachable objects have their memory in sync with the C++ abstract machine, and assume they also might have been modified.


I am far away from being an expert in low-level code, but as far as I understand the implementation, the function serves as a read/write barrier. So - basically - it ensures that the value passed in is either in a register or in memory.

While this seems to be entirely reasonable if I want to preserve the result of a function (which should be benchmarked) generally, I am a bit surprised about the degree of freedom left for the compiler.

My understanding of the given code is that the compiler may insert a materialization point whenever DoNotOptimize is called, which would imply a notable amount of overhead when executed repeatedly (e.g., in a loop). When the value should not optimize out is just a single scalar value, it seems to be sufficient if the compiler ensures that the value resides in a register.

Wouldn't it be a good idea to distinguish between pointers and non-pointers for instance:

template< class T >
inline __attribute__((always_inline)) 
void do_not_optimize( T&& value ) noexcept {
    if constexpr( std::is_pointer_v< T > ) {
        asm volatile("":"+m"(value)::"memory");
    } else {
        asm volatile("":"+r"(value)::);
    }
}
Feretory answered 25/3, 2021 at 8:4 Comment(2)
Please provide an example where it currently doesn't work as expected.Stringer
It is more or less a general question, * whether* it could happen, that the register is spilled (which would lead to unnecessary / not benchmark-related overhead).Feretory
H
9

You're wondering about the "memory" clobber? Yeah, that can cause other things to be spilled, but sometimes that's what you want between iterations of something you're trying to wrap a repeat loop around.

Note that a "memory" clobber doesn't affect objects that aren't possibly reachable from global variables. (Escape analysis). So it won't cause stuff like the loop counter in for(int i = ...) to be spilled/reloaded.

Materializing the value of the specified variable in a register (and forgetting about its value for constant-propagation or CSE purposes) is exactly the point of this function, and is cheap. Unless stuff really is optimizing away, the value will already be in a register.

(Unless it's a case where tmp1 = a+b; / tmp2 = tmp1+c, but the compiler would rather do b+c first. In that case forcing tmp1 to be materialized would force it do actually do a+b. Normally this isn't an issue because people generally don't use DoNotOptimize on temporaries that are part of a larger calculation.)


I think it's intentional to have this err on the side of blocking more stuff like hoisting of loads of loop-invariants and other CSE or strength-reduction of things across iterations or a repeat-loop in a benchmark. It's pretty common to see people use benchmark::DoNotOptimize() on just the final result of a computation or something; if it didn't have a "memory" clobber, it would be even less likely to stop the compiler from preparing the value (or some invariant parts) once and just moving it to materialize it in a register every iteration.

People that understand exactly what they're trying to benchmark well enough to be checking on the compiler-generated asm certainly might want to use asm("" : "+g"(var)); to make the compiler materialize it and forget what it knows about the value, without triggering any spilling of other globals.

(The "+r,m" is a workaround for clang which tends to invent a memory temporary for "+rm" or "+g". GCC picks register when it can.)


"+m" for pointers

Nope, that would force the compiler to spill the pointer value itself, which you don't want. You only want to make sure the pointed-to memory is also in sync, in case that's what a user expects, so a "memory" clobber makes sense there.

Or the other way without a "memory" clobber:

asm volatile("" : "+r"(ptr), "+m"(*ptr));

Or for a a whole array of pointed-to objects (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

// deref pointer-to-array of unspecified size
asm volatile("" : "+r"(ptr), "+m"( *(T (*)[]) ptr  );

But if ptr is NULL, either of these may break, so it's not safe for a generic definition to use either of these for all pointers.

Using these manually, you might leave out the + on either the pointer itself in a register, or on the pointed-to memory, to just force materializing the value without forgetting about it later.

You might also omit the "+r"(ptr) operand and just make sure the pointed-to memory is in sync without forcing the exact pointer to exist in a register. The compiler still has to be able to generate an addressing mode referencing the memory, and you can see what it picked by having the asm template expand the operand:

asm( "nop  # mem operand picked %0" : "+m" (*ptr) );

You don't need a nop, it can be a pure asm comment line like # hi mom, operand at %0, but the Godbolt compiler explorer (https://godbolt.org/z/doPGsse9c for this example) filters comments by default so it's convenient to use an instruction. It doesn't even have to be valid, though, if you just want to look at GCC's asm output. e.g. nop # mem operand picked 40(%rdi) for int *ptr = func_arg+10;.

GCC's asm templates are purely a text substitution like printf to put text into the output file at positions where GCC chooses to expand the asm statement. Clang is different, though; it has a built-in assembler that operates on inline asm.

Herzegovina answered 25/3, 2021 at 19:1 Comment(1)
Update: How does Google's `DoNotOptimize()` function enforce statement ordering - The "memory" clobber stop DoNotOptimize from reordering with time(), for example. Perhaps that's part of why they included it?Herzegovina

© 2022 - 2024 — McMap. All rights reserved.