I have this C:
#include <stddef.h>
size_t findChar(unsigned int length, char* __attribute__((aligned(16))) restrict string) {
for (size_t i = 0; i < length; i += 2) {
if (string[i] == '[' || string[i] == ' ') {
return i;
}
}
return -1;
}
It checks every other character of a string and returns the first index of the string that is [
or
. With x86-64 GCC 10.2 -O3 -march=skylake -mtune=skylake
, this is the assembly output:
findChar:
mov edi, edi
test rdi, rdi
je .L4
xor eax, eax
.L3:
movzx edx, BYTE PTR [rsi+rax]
cmp dl, 91
je .L1
cmp dl, 32
je .L1
add rax, 2
cmp rax, rdi
jb .L3
.L4:
mov rax, -1
.L1:
ret
It seems like it could be optimized significantly, because I see multiple branches. How can I write my C so that the compiler optimizes it with SIMD, string instructions, and/or vectorization?
How do I write my code to signal to the compiler that this code can be optimized?
Interactive assembly output on Godbolt: https://godbolt.org/z/W19Gz8x73
Changing it to a VLA with an explicitly declared length doesn't help much: https://godbolt.org/z/bb5fzbdM1
This is the version of the code modified so that the function would only return every 100 characters: https://godbolt.org/z/h8MjbP1cf
i < length
,string[i] == '['
, andstring[i] == ' '
. Are any of them optional? – Kattiestrlen
orstrchr
in libc. Is it safe to read past the end of a buffer within the same page on x86 and x64?. This is actually just a missed optimization in GCC/clang. ICC does know how to auto-vectorize loops whose trip-count can't be calculated ahead of time (e.g. search loops) – Brebnerconst char str[length]
as a function arg still doesn't promise the compiler it can touch memory other than what the abstract machine does. It's still exactly equivalent toconst char *str
. C99 does have syntax likeconst char str[static 100]
which might even work with a variable length, but IIRC GCC doesn't usually take advantage anyway. (What is the purpose of static keyword in array parameter of function like "char s[static 10]"?) – Brebner_mm_cmpeq_epi8
/_mm_movemask_epi8
, or are you still looking to keep it portable? – Brebneruint64_t
-size chunks, using portable C code? – Reprovableuint64_t
safely; you need eithermemcpy
or a typedef with GNU C__attribute__((may_alias))
, like shown in Why does glibc's strlen need to be so complicated to run quickly? where my answer shows how to fix the strict-aliasing bugs in glibc's portable-C fallback version. So you might want to link or reference that if you're planning to write a version based on graphics.stanford.edu/~seander/bithacks.html#ZeroInWord – Brebner(void *)
anduintptr_t
should work, I would think? I will look at your first link. – Reprovable__attribute((aligned(1), may_alias))
. Accessing a char object through auint64_t*
dereference is UB. (That could happen if passed a pointer to achar array[64]
array object for example. Dynamically allocated memory is anonymous, no known type, so accessing viauint64_t*
here andchar*
everywhere else would be fine because char* can alias anything. But named variables have types. Still maybe unlikely to cause a problem in practice if array access works like char*, not sure) – Brebnerchar
is gcc, strict-aliasing, and horror stories. BTW, Intel intrinsics are defined to avoid these problems: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? – Brebnerchar *
to begin with. Casting viavoid *
used to work just fine to tell the compiler to forget the type a pointer used to point to (leaving issues about making sure accesses are naturally aligned), and is standard compliant by my reading of the C standard. But I didn't go to language-lawyer school. [Later:] Read your link. I am with Linus on this one. – Reprovable[
and ` ` is an optimized case. However, I could also split it into two single character searches of-
and[
looking at every char, with ~100 between. For finding both[
and `, 100 characters looking at every other char, but every 4-5 chars, it will find a
` and need to execute an additional check. I am willing to rewrite. – Violoncellostrchr()
, which presumably is highly optimized? Assuming your strings are not overly long, the second call would benefit from the first call pulling the data into caches. The above comments by Peter Cordes have convinced me that it would be a royal pain to try and write fast string functions in C instead of assembly language, at least when using gcc. – Reprovablechar*
function arg has to be pointing to something, and it's UB if that's an object that's definitely notuint64_t
, unless you usememcpy
intouint64_t
instead of deref ofuint64_t*
, or a typedef. (In practice this matters after function inlining; e.g. violating strict aliasing can mean it's not safe to compile with LTO). Just for example, say the original data was an array ofunsigned short
and that's why only every 2nd char matters. After inlining, the compiler can assume thatuint64_t*
derefs aren't reading the same data thatmy_u16[i] = '['
wrote. – Brebnerchar*
orvoid*
function arg doesn't "launder" it in terms of removing strict-aliasing UB; it doesn't make it safe to deref it as types other than the original. That's true even if the original data was achar buf[100]
. It would be fine if you hadchar *buf = malloc(100);
though, because then the only accesses to it would be viachar*
or in fast-strings functions, as long as all your fast-strings stuff uses the same type. (char*
is allowed to alias anything, like__m128i*
, ormy_aliasing_u64*
with a GNU C typedef.) – Brebneruint64_t aliasing_u64_load(void *p) { uint64_t tmp; memcpy(tmp, p, sizeof(tmp)); return tmp; }
. (That also makes unaligned loads safe, so GCC won't always inline it as a single load instruction if it can't prove alignment, on ISAs where unaligned word loads aren't safe.) – Brebner