First of all, f1
stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1
doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8]
(but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]};
would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1
, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2
could have been optimized the same as f3
, to a qword cmp
, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2
, thanks @Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp
if that's what you want; apparently most compilers don't spend time looking for patterns like f2
. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy
to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t*
would potentially violate alignof(uint64_t)
, and probably violate the strict-aliasing rule unless the actual object pointed to by the char*
was compatible with uint64_t
.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps
or other problems can happen with real compilers in corner cases.
return ! std::any_of(ptr, ptr+8)
. – Northeasterly