GCC seems to prefer small immediate values in comparisons. Is there a way to avoid that?

Asked 30/11, 2019 at 15:30 Answered 1/12, 2019 at 6:34

Firstly a trivial mathematical fact: given integers n and m, we have n < m if, and only if, n <= m - 1.

GCC seems to prefer immediate values of smaller absolute value. Hence, when m is known and other conditions are met, the compiler chooses among equivalent comparison expressions the one minimizing absolute values. For instance, it prefers n <= 1000 over n < 1001 and GCC 9.2 translates this

bool f(uint32_t n) {
  return n < 1001;
}

into this x86 assembly code

f(unsigned int):
  cmpl $1000, %edi
  setbe %al
  ret

There might be good performance reasons for that but that's not my question. What I'd like to know is this: Is there a way to force GCC to keep the original comparison? More specifically, I'm not worried about portability and thus, GCC specifics (options, pragmas, attributes, ...) are OK for me. However, I'm looking for a constexpr friendly solution which seems to exclude inline asm. Finally, I'm targeting C++17 which excludes things like std::is_constant_evaluated. (Having said that, please, fell free to provide answers regardless of my constraints because it might still be useful for others.)

You might ask why I want to do such thing. Here we go. To my understanding (please, correct me if I'm wrong) this behavior might be a "pessimization" for x86_64 in the following example:

bool g(uint64_t n) {
  n *= 5000000001;
  return n < 5000000001;
}

which is translated by GCC 6.2 into

g(unsigned long):
  movabsq $5000000001, %rax
  imulq %rax, %rdi
  movabsq $5000000000, %rax
  cmpq %rax, %rdi
  setbe %al
  ret

In x86_64, computations with 64-bits immediate values have some restrictions that might imply these values to be loaded into registers. In the above example, this happens twice: constants 5000000001 and 5000000000 are stored in rax for the multiplication and the comparison. Had GCC kept the original comparison as it appears in the C++ code (i.e., against 5000000001) there would be no need for the second movabs.

This also implies a code size penalty which, I guess, was considered an issue and more recent versions of GCC (e.g. 9.2) produce this:

g(unsigned long):
  movabsq $5000000001, %rax
  imulq %rax, %rdi
  subq $1, %rax
  cmpq %rax, %rdi
  setbe %al
  ret

Hence the 10-bytes long movabs was replaced by a 4-bytes long subq instruction. In any case, subq also seems unnecessary.

Derrik answered 30/11, 2019 at 15:30 Comment(5)

Report a gcc missed-optimization bug on gcc.gnu.org/bugzilla. Hand-holding the compiler with source tricks is potentially useful in the short term, but the real fix is for GCC to consider both options for CSE when it doesn't fit in an immediate (or always; a reg operand takes less space if it's already in a reg for another reason). This optimization helps gcc use an imm8 instead of imm32 whenever possible, or a sign_extended_imm32 instead of movabs. But when it's not on the cusp there, it's not useful. – Exteriorize 30/11, 2019 at 15:47

@Peter I couldn't agree more and that's the reason I, temporarily, want to use a source trick. In the missed-optimization report I want to include measurements showing that the code generated using the source trick is indeed faster than the one not using it. I hope to convince compiler writers to turn "the trick" into a real fix. For the time being, I'm using inline asm to prove my point. (Actually, the missed-optimization problem that I have at hand is much broader and this comparison is only a piece of it.) – Derrik 30/11, 2019 at 16:8

Hand-edited asm is an easier way to prove your point if you want to micro benchmark. GCC devs would see this as an obvious problem worth fixing even without that. But sure, a real benchmark demonstrating a significant slowdown might motivate someone to spend time on it sooner. – Exteriorize 30/11, 2019 at 19:1

fwiw, gcc 7.1 and later replace the second mov in the multiplication example with sub immediate 1. clang reuses the value as you wish. – Selfinsurance 1/12, 2019 at 7:28

Re: why GCC prefers smaller immediates: Understanding gcc output for if (a>=3) as in my earlier comment. – Exteriorize 29/10, 2021 at 8:43

Here is a C++20 solution which, sadly, I cannot use

#include <cstdint>
#include <type_traits>

template <class U>
bool less_asm(U n, U m) noexcept {
  bool r;
  asm("cmp%z[m]\t{%[m], %[n]|%[n], %[m]}"
    : "=@ccb"(r) : [n]"r"(n), [m]"re"(m) : "cc");
  return r;
}

template <class U>
constexpr bool less(U n, U m) noexcept {
  if (std::is_constant_evaluated())
    return n < m;
  return less_asm(n, m);
}

static_assert(less(uint64_t(0), uint64_t(1)));

bool g(uint64_t n) {
  n *= 5000000001;
  return less<uint64_t>(n, 5000000001);
}

GCC 9.2 (with -O2 -std=c++2a) generates this:

g(unsigned long):
  movabsq $5000000001, %rax
  imulq %rax, %rdi
  cmpq %rax, %rdi
  setc %al
  ret

Update: The snippet below shows two improvements:

It works with C++17 but it requires GCC-9.1 or later. (Thanks to Peter Cordes' suggestion to use __builtin_constant_p() instead of std::is_constant_evaluated(). Prior versions of GCC complain about asm appearing in a constexpr function.)
It introduces fewer names in the scope. (By using a lambda instead of less_asm.)

template <class U>
constexpr bool less(U n, U m) noexcept {
  if (__builtin_constant_p(n < m))
    return n < m;
  return [&]{
    bool r;
    asm("cmp\t{%[m], %[n]|%[n], %[m]}"
      : "=@ccb"(r) : [n]"r"(n), [m]"re"(m) : "cc");
    return r;
  }();
}

Derrik answered 30/11, 2019 at 15:30 Comment(4)

Instead of std::is_constant_evaluated() you can use GNU C __builtin_constant_p(n < m) to ask if an expression has a compile-time-constant value (after constant-propagation optimization), regardless of C++ constexpr. – Exteriorize 30/11, 2019 at 18:59

@PeterCordes I've tried that before and it didn't work. It has even confused me on my understanding of std::is_constant_evaluated and __builtin_constant_p. Now it does work (since GCC 9.1). Consider writing this as an answer. – Derrik 30/11, 2019 at 19:35

Did you forget to enable optimization the previous time? Without optimization, __builtin_constant_p(var) will always be false (except for const) because GCC doesn't optimize across statements at -O0. I don't really know what std::is_constant_evaluated() is for. I don't think this really needs a separate answer; it would just be a minor tweak to yours that already uses inline asm, which I consider the main point of the answer if you just want to demo in a benchmark. gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html – Exteriorize 30/11, 2019 at 19:43

@PeterCordes: No, I haven't forget. The problem before was the compiler support. It changed from version 8.3 to 9.1 as one can see here: godbolt.org/z/Q9-E-S (I've first considered the question when the latest version of GCC was 8.2.) – Derrik 1/12, 2019 at 11:46

In general the only way to force GCC to output a particular instruction sequence that you might want is to use assembly. If GCC isn't generating optimal code, the best thing to do is to submit a bug report so hopefully it'll be improved in future versions.

For your specific case there's at least one work around that can generate the code you want, at least for current compilers. While it doesn't require writing the entire code sequence in assembly, it does use inline assembly to load the constant into a register on the assumption that GCC will prefer using this register rather than generating a new constant value for the comparison.

#include <stdint.h>

static uint64_t 
load_const_asm(uint64_t c) {
    if (__builtin_constant_p(c) && (int32_t) c != c) {
           asm("" : "+r" (c));
    }
    return c;
}

bool
g(uint64_t n) {
    uint64_t c = load_const_asm(5000000001);
    n *= c;
    return n < c;
}

which generates the following code when compiled with -O on GCC 9.2:

_Z1gm:
  movabs rax, 5000000001
  imul rdi, rax
  cmp rdi, rax
  setb al
  ret

This asm statement tricks the compiler in not generating a separate load of the constant minus one because the compiler doesn't know what asm statement does, even though its just an empty string. It forces the constant into a register, but the compiler has no idea what the register will contain after the asm statement "executes".

The advantage of doing it this way, over using an asm statement with a CMP instruction, is that it gives the compiler the most freedom optimize your code. This is a big disadvantage of inline assembly generally, it's easy for it to end up pessimizing your code. For example, using inline assembly prevents the compiler from calculating the result of g() at compile time if n is a constant. However with my code example above, it's still able to figure out that g() must return true if n is 1.

Finally, make sure that you're not trying to prematurely optimize your code. You don't want to be obfuscating your code like this if it's not going to make any noticeable different to performance your application. If you do end up using this code, or some other hack, then as Peter Cordes said in a comment, you should clearly comment your code to explain why the hack is necessary so it can be removed when it's no longer needed.

Billingsgate answered 1/12, 2019 at 6:34 Comment(9)

IIRC, mov $imm, %r64 will use movabs or movq according to the size of the immediate. So it's less of a pessimization if used with small values. You might also use if (__builtin_constant_p(n <= UINT32_MAX) && n <= UINT32_MAX) to see if a value can be an immediate. Err actually that needs to check for being representable as 32-bit sign_extended, not unsigned. – Exteriorize 1/12, 2019 at 7:1

@PeterCordes I realized minutes after I shutdown my computer for the night that the two asm statements were essentially the same, either way the compiler doesn't know what's in the register after it executes. So, there was no point in using anything other than the first function which leaves it completely up to the compiler how to get the constant into register. – Billingsgate 1/12, 2019 at 17:31

@PeterCordes Ah, I was mostly thinking of this from the perspective of it being only used where the compiler isn't using an immediate. – Billingsgate 1/12, 2019 at 20:33

Cassio's answer handles the case where both operands are compile-time-constants so constant-propagation through the multiply and compare are possible. (But not the case of a small constant against a runtime-variable other operand). But yes, using an empty asm statement to just "launder" the constant for the optimizer makes more sense. Maybe only needed for the operands to < ; you can let the compiler use the number directly for the multiply. – Exteriorize 1/12, 2019 at 20:45

But this is probably fine as a hack to work around a compiler bug as a temporary fix. If anyone uses this, be sure to comment that it's just a workaround for a GCC missed optimization and can be removed once that's fixed. – Exteriorize 1/12, 2019 at 20:47

@PeterCordes If you don't use the same return value of load_const_result in both the multiplication and the comparison then there's an extra MOV instruction inserted because the compiler thinks that the register operand of the asm statement was modified. I'd updated my example to only launder the constant if it's not representable as a 32-bit signed integer. – Billingsgate 1/12, 2019 at 21:32

Oh yeah, c != (int32_t)c, I was overcomplicating that :P. The extra mov is a separate GCC missed optimization bug that I wasn't anticipating. godbolt.org/z/Dr97p8 shows that clang just lets the known constant value get "destroyed" by the result of the asm statement without using mov. I also included a version of g() that checks __builtin_constant_p(n) before "asm laundering" c so there's no need to bother trying to only launder for cmp. I think I was only picturing that to allow possible constprop through multiply. – Exteriorize 1/12, 2019 at 22:10

(I don't feel like writing that up as an answer; if you want to include any/all of that in yours that would be great.) – Exteriorize 1/12, 2019 at 22:12

Imaginative solution! One might think it's useless to force the constant to go to a register since this is exactly what the compiler does without intervention. However, by doing so you change the compiler behavior! Another example where intuition can be misleading. Funnily enough, it feels like using the register keyword as it was firstly imagined when C was designed and before it was made useless by all compilers. (It's now removed from C++17.) – Derrik 2/12, 2019 at 10:49

-3

gcc 9.2.1 on Linux with -O2 gives:

movabsq $5000000001, %rax
imulq   %rax, %rdi
subq    $1, %rax
cmpq    %rax, %rdi
setbe   %al
ret

So there doesn't seem to be anything special you need to do unless you're really worried about that subtraction.

Polyhistor answered 30/11, 2019 at 15:41 Comment(2)

I did show the code above in my original post and as I said, the subtraction could be avoided. – Derrik 30/11, 2019 at 15:47

It's still an extra uop, only saving some of the code-size. (Although movabs is pretty bad, and can be slow to fetch from the uop cache.) – Exteriorize 30/11, 2019 at 15:52

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags