Test of 8 subsequent bytes isn't translated into a single compare instruction
Asked Answered
C

3

4

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):

bool f1(const char *ptr)
{    
  for (int i = 0; i < 8; i++)
    if (ptr[i])
      return false;
  return true;
}

bool f2(const char *ptr)
{  
  bool res = true;
  for (int i = 0; i < 8; i++)
    res &= (ptr[i] == 0);
  return res;
}

bool f3(const char *ptr)
{  
  static const char tmp[8]{};
  return !std::memcmp(ptr, tmp, 8);
}

Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.

Live demo: https://godbolt.org/z/j48366

Charades answered 13/8, 2020 at 9:55 Comment(3)
Clang trunk actually does optimise f2 to the same code as the memcmp, but weirdly no other version doesPhenylketonuria
There's also return ! std::any_of(ptr, ptr+8).Northeasterly
@Northeasterly Same outcome: godbolt.org/z/Gn3xf4.Charades
P
2

First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.

You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)


GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)

Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that.

Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)

The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.


In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.

Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.

Or in GNU C++, use a typedef to express unaligned may-alias loads:

bool f4(const char *ptr) {
   typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
    auto val = *(const aliasing_unaligned_u64*)ptr;
    return val != 0;
}

Compiles on Godbolt with GCC10 -O3:

f4(char const*):
        cmp     QWORD PTR [rdi], 0
        setne   al
        ret

Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.

And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.

Pander answered 13/8, 2020 at 10:6 Comment(0)
X
3

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.

In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :

f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array

For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm

Xe answered 13/8, 2020 at 10:6 Comment(11)
I don't think a compiler should worry about UB when doing optimizations. It's a responsibility of a function user to avoid UB. And, what about f2?Charades
I disagree with you, a compiler cannot introduce UB even to optimize. For f2 I agree that can be replaced by a 8 byte comparison under the condition there is no alignment problem, I mean it is 'legal' to read a word of 8 bytes from any addressXe
You're right, I see your point now about UB. That's absolutely true.Charades
@DanielLangr I edited my answer to speak about f2Xe
@DanielLangr: "introduce UB" is not really the right way to describe it. UB is a C++ abstract machine concept. The relevant concept here is inventing reads, which is only allowed if it's known they won't fault. Unless you check for being within 8 bytes of a page boundary first, you don't know that without knowing alignment. (Since x86-64 memory protection is based on 4k pages as its finest granularity.) Is it safe to read past the end of a buffer within the same page on x86 and x64? - UB in C, well defined in asm and done in practice in strlenPander
@PeterCordes thank for your remark and the link to your very interesting answer, I edited my answerXe
f2 can always be compiled to a qword cmp; those caveats apply to f1. In any case where an 8-byte load would fault, the C++ abstract machine would encounter UB because it also touches all 8 bytes. x86-64 never requires alignment for an 8-byte load. (Unless you set the AC alignment-checking flag, but C++ implementations can and do assume you haven't done that, especially hand-written memcpy in libc.)Pander
In fact clang trunk nightly builds do optimize f2 into the same asm as f3.Pander
@PeterCordes not surprising because f2 and f3 have the same (may be bad) behavior in general (I mean independently of the CPU / OS ...)Xe
It's not bad if this function is intended to be called on 8-byte buffers, but you neglected to tell the compiler that (by writing f1(char*) instead of f1(char [static 8]). Err that might only be a C feature, not C++. Too bad, sucks for C++ I guess if you can't tell your compiler useful information).Pander
@PeterCordes: In C++ you can just write ptr[7]; in f1. It's UB when ptr is less than 8 elements. Hence the optimizer can subsequently assume that ptr has at least 8 elements.Northeasterly
P
2

First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.

You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)


GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)

Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that.

Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)

The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.


In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.

Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.

Or in GNU C++, use a typedef to express unaligned may-alias loads:

bool f4(const char *ptr) {
   typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
    auto val = *(const aliasing_unaligned_u64*)ptr;
    return val != 0;
}

Compiles on Godbolt with GCC10 -O3:

f4(char const*):
        cmp     QWORD PTR [rdi], 0
        setne   al
        ret

Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.

And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.

Pander answered 13/8, 2020 at 10:6 Comment(0)
M
-1

You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.

If your compiler does not support uint64_t, you can use unsigned long long* instead:

#include <cstdint>

inline bool EightBytesNull(const char *ptr)
{  
    return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}

Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Mim answered 13/8, 2020 at 10:5 Comment(5)
That violates strict-aliasing and is thus undefined behaviour. Only MSVC defines the behaviour of that pointer-cast, out of mainstream x86 compilers. Do not use this, just use memcmp and let it compile to a single cmp instruction.Pander
I am not sure this is safe, since the memory pointed to by ptr may not be aligned to alignof(uint64_t).Charades
@DanielLangr: Oh yes, it's unsafe for that reason as well. You'd have to use something like GNU C typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias)). See Why does glibc's strlen need to be so complicated to run quickly? for an example of such a typedef.Pander
@DanielLangr and Michael: Misalignment isn't safe on x86 either; auto-vectorization can create code that assumes the ABI-guaranteed alignment. It's much rarer for it to break, but it's not safe. See Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?, and trust-in-soft.com/blog/2020/04/06/…. Since it also violates strict-aliasing, it's very wrong to say "this will work" anywhere except maybe MSVC. You can memcpy 8 bytes into a uint64_t and then compare that.Pander
Oops, meant to link my answer on Why does glibc's strlen need to be so complicated to run quickly?. See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?Pander

© 2022 - 2024 — McMap. All rights reserved.