Test of 8 subsequent bytes isn't translated into a single compare instruction

C

3

4

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):

bool f1(const char *ptr)
{    
  for (int i = 0; i < 8; i++)
    if (ptr[i])
      return false;
  return true;
}

bool f2(const char *ptr)
{  
  bool res = true;
  for (int i = 0; i < 8; i++)
    res &= (ptr[i] == 0);
  return res;
}

bool f3(const char *ptr)
{  
  static const char tmp[8]{};
  return !std::memcmp(ptr, tmp, 8);
}

Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.

Live demo: https://godbolt.org/z/j48366

Charades answered 13/8, 2020 at 9:55 Comment(3)

Clang trunk actually does optimise f2 to the same code as the memcmp, but weirdly no other version does – Phenylketonuria 13/8, 2020 at 10:18

There's also return ! std::any_of(ptr, ptr+8). – Northeasterly 13/8, 2020 at 10:40

@Northeasterly Same outcome: godbolt.org/z/Gn3xf4. – Charades 13/8, 2020 at 14:7

P

2

First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.

You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)

GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)

Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that.

Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)

The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.

In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.

Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.

Or in GNU C++, use a typedef to express unaligned may-alias loads:

bool f4(const char *ptr) {
   typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
    auto val = *(const aliasing_unaligned_u64*)ptr;
    return val != 0;
}

Compiles on Godbolt with GCC10 -O3:

f4(char const*):
        cmp     QWORD PTR [rdi], 0
        setne   al
        ret

Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.

And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.

https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

Pander answered 13/8, 2020 at 10:6 Comment(0)

X

3

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.

In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :

f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array

For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm

Xe answered 13/8, 2020 at 10:6 Comment(11)

I don't think a compiler should worry about UB when doing optimizations. It's a responsibility of a function user to avoid UB. And, what about f2? – Charades 13/8, 2020 at 10:10

I disagree with you, a compiler cannot introduce UB even to optimize. For f2 I agree that can be replaced by a 8 byte comparison under the condition there is no alignment problem, I mean it is 'legal' to read a word of 8 bytes from any address – Xe 13/8, 2020 at 10:11

You're right, I see your point now about UB. That's absolutely true. – Charades 13/8, 2020 at 10:12

@DanielLangr I edited my answer to speak about f2 – Xe 13/8, 2020 at 10:14

@DanielLangr: "introduce UB" is not really the right way to describe it. UB is a C++ abstract machine concept. The relevant concept here is inventing reads, which is only allowed if it's known they won't fault. Unless you check for being within 8 bytes of a page boundary first, you don't know that without knowing alignment. (Since x86-64 memory protection is based on 4k pages as its finest granularity.) Is it safe to read past the end of a buffer within the same page on x86 and x64? - UB in C, well defined in asm and done in practice in strlen – Pander 13/8, 2020 at 10:18

@PeterCordes thank for your remark and the link to your very interesting answer, I edited my answer – Xe 13/8, 2020 at 10:44

f2 can always be compiled to a qword cmp; those caveats apply to f1. In any case where an 8-byte load would fault, the C++ abstract machine would encounter UB because it also touches all 8 bytes. x86-64 never requires alignment for an 8-byte load. (Unless you set the AC alignment-checking flag, but C++ implementations can and do assume you haven't done that, especially hand-written memcpy in libc.) – Pander 13/8, 2020 at 10:49

In fact clang trunk nightly builds do optimize f2 into the same asm as f3. – Pander 13/8, 2020 at 10:56

@PeterCordes not surprising because f2 and f3 have the same (may be bad) behavior in general (I mean independently of the CPU / OS ...) – Xe 13/8, 2020 at 11:1

It's not bad if this function is intended to be called on 8-byte buffers, but you neglected to tell the compiler that (by writing f1(char*) instead of f1(char [static 8]). Err that might only be a C feature, not C++. Too bad, sucks for C++ I guess if you can't tell your compiler useful information). – Pander 13/8, 2020 at 11:10

@PeterCordes: In C++ you can just write ptr[7]; in f1. It's UB when ptr is less than 8 elements. Hence the optimizer can subsequently assume that ptr has at least 8 elements. – Northeasterly 14/8, 2020 at 7:50