How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?
Asked Answered
C

1

6

I have this C:

#include <stddef.h>
size_t findChar(unsigned int length, char*  __attribute__((aligned(16))) restrict string) {
    for (size_t i = 0; i < length; i += 2) {
        if (string[i] == '[' || string[i] == ' ') {
            return i;
        }
    }
    return -1;
}

It checks every other character of a string and returns the first index of the string that is [ or . With x86-64 GCC 10.2 -O3 -march=skylake -mtune=skylake, this is the assembly output:

findChar:
        mov     edi, edi
        test    rdi, rdi
        je      .L4
        xor     eax, eax
.L3:
        movzx   edx, BYTE PTR [rsi+rax]
        cmp     dl, 91
        je      .L1
        cmp     dl, 32
        je      .L1
        add     rax, 2
        cmp     rax, rdi
        jb      .L3
.L4:
        mov     rax, -1
.L1:
        ret

It seems like it could be optimized significantly, because I see multiple branches. How can I write my C so that the compiler optimizes it with SIMD, string instructions, and/or vectorization?

How do I write my code to signal to the compiler that this code can be optimized?

Interactive assembly output on Godbolt: https://godbolt.org/z/W19Gz8x73

Changing it to a VLA with an explicitly declared length doesn't help much: https://godbolt.org/z/bb5fzbdM1

This is the version of the code modified so that the function would only return every 100 characters: https://godbolt.org/z/h8MjbP1cf

Cuprite answered 5/4, 2021 at 20:15 Comment(26)
I see three branches: i < length, string[i] == '[', and string[i] == ' '. Are any of them optional?Kattie
@RobertHarvey No. Is there a way to implement these without cmp/jmp on an assembly level?Violoncello
I don't see how. You still need to make the checks, and cmp/jmp is the way assembly does this.Kattie
@RobertHarvey Would it be possible e.g. compare 8 bytes at a time with a bitwise comparison and use a branch at the end of the 8 bytes? I've seen GCC write code like that before.Violoncello
Wouldn't that result in more branches, not less?Kattie
compare == branch.Kattie
The compiler cannot vectorise the code because the code does not access the string beyond a matching character. This character might be the last character on the last mapped page. So the compiler cannot safely generate code that fetches multiple characters and check them in parallel. You can e.g. change the code such that it simply sets a variable on match and proceeds through the rest of the string. This way, the compiler can make more assumptions about what memory accesses it may perform.Thomasson
@Thomasson Adding a length parameter doesn't change the code: godbolt.org/z/zsv3eGGxb Does using a VLA signal that the entire string is a valid array?Violoncello
@fuz: Not true; a compiler targeting a specific mainstream OS knows that memory protection has page granularity, not segmentation with some arbitrary byte limit, so it can use code that works the same way as the hand-written asm for strlen or strchr in libc. Is it safe to read past the end of a buffer within the same page on x86 and x64?. This is actually just a missed optimization in GCC/clang. ICC does know how to auto-vectorize loops whose trip-count can't be calculated ahead of time (e.g. search loops)Brebner
@Thomasson Doing exactly what you suggested results in different assembly, but it doesn't seem like it is doing any special optimizations. godbolt.org/z/Po6rd1WWYVioloncello
@Cuprite Interesting. Not sure how to coax the compiler into optimising it then.Thomasson
@Thomasson and OP: const char str[length] as a function arg still doesn't promise the compiler it can touch memory other than what the abstract machine does. It's still exactly equivalent to const char *str. C99 does have syntax like const char str[static 100] which might even work with a variable length, but IIRC GCC doesn't usually take advantage anyway. (What is the purpose of static keyword in array parameter of function like "char s[static 10]"?)Brebner
Of course, even with a static array where the size is definitely known, GCC still won't do this optimization; only ICC's auto-vectorizer can handle loops whose trip-count can't be calculated before the first iteration runs. Are you interested in how to optimize this for x86 specifically, with SSE or AVX intrinsics like _mm_cmpeq_epi8 / _mm_movemask_epi8, or are you still looking to keep it portable?Brebner
@Cuprite On average, how many characters to you expect the code to search before finding a match? Are you willing to rewrite the source code to allow examining the string in uint64_t-size chunks, using portable C code?Reprovable
@njuffa: note that it's non-trivial to use uint64_t safely; you need either memcpy or a typedef with GNU C __attribute__((may_alias)), like shown in Why does glibc's strlen need to be so complicated to run quickly? where my answer shows how to fix the strict-aliasing bugs in glibc's portable-C fallback version. So you might want to link or reference that if you're planning to write a version based on graphics.stanford.edu/~seander/bithacks.html#ZeroInWordBrebner
@PeterCordes FWIW, my plan would be to write it using the same ideas I used to implement various string functions in Solaris twenty years ago, using naturally aligned 64-bit loads for the bulk of the processing. Those were admittedly written in SPARC assembly language (when I last checked ten years ago my handy work was still visible in the OpenSolaris source base). In this case the source pointer is already guaranteed to be 16-byte aligned if I read that correctly, so casting the pointer via (void *) and uintptr_t should work, I would think? I will look at your first link.Reprovable
@Cuprite - your 2nd Godbolt link (godbolt.org/z/Po6rd1WWY) does actually always traverse the whole array, so in theory could auto-vectorize. But recording the match-position is inconvenient / difficult to make asm that actually does that, and it would likely be sub-optimal except possibly in the case of short fixed-length buffers (like 2 or 4 vectors worth) if the function is called very frequently with the same size, so branch prediction can "learn" how many iterations the loop runs.Brebner
@njuffa: njuffa: alignment isn't the problem, the strict-aliasing rule is. Otherwise I would have included __attribute((aligned(1), may_alias)). Accessing a char object through a uint64_t* dereference is UB. (That could happen if passed a pointer to a char array[64] array object for example. Dynamically allocated memory is anonymous, no known type, so accessing via uint64_t* here and char* everywhere else would be fine because char* can alias anything. But named variables have types. Still maybe unlikely to cause a problem in practice if array access works like char*, not sure)Brebner
@njuffa: An example of breakage with a type other than char is gcc, strict-aliasing, and horror stories. BTW, Intel intrinsics are defined to avoid these problems: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?Brebner
@PeterCordes I am afraid you lost me there. Here we have a char * to begin with. Casting via void * used to work just fine to tell the compiler to forget the type a pointer used to point to (leaving issues about making sure accesses are naturally aligned), and is standard compliant by my reading of the C standard. But I didn't go to language-lawyer school. [Later:] Read your link. I am with Linus on this one.Reprovable
@Reprovable The [ and ` ` is an optimized case. However, I could also split it into two single character searches of - and [ looking at every char, with ~100 between. For finding both [ and ` , 100 characters looking at every other char, but every 4-5 chars, it will find a ` and need to execute an additional check. I am willing to rewrite.Violoncello
@Reprovable This the function modified to have ~100 characters between a return: godbolt.org/z/h8MjbP1cf (It skips over 2 characters at a time, hence the '-' check)Violoncello
@Cuprite Have you tried simply making two calls to system-provided strchr(), which presumably is highly optimized? Assuming your strings are not overly long, the second call would benefit from the first call pulling the data into caches. The above comments by Peter Cordes have convinced me that it would be a royal pain to try and write fast string functions in C instead of assembly language, at least when using gcc.Reprovable
@njuffa: The char* function arg has to be pointing to something, and it's UB if that's an object that's definitely not uint64_t, unless you use memcpy into uint64_t instead of deref of uint64_t*, or a typedef. (In practice this matters after function inlining; e.g. violating strict aliasing can mean it's not safe to compile with LTO). Just for example, say the original data was an array of unsigned short and that's why only every 2nd char matters. After inlining, the compiler can assume that uint64_t* derefs aren't reading the same data that my_u16[i] = '[' wrote.Brebner
@njuffa: Passing a pointer via a char* or void* function arg doesn't "launder" it in terms of removing strict-aliasing UB; it doesn't make it safe to deref it as types other than the original. That's true even if the original data was a char buf[100]. It would be fine if you had char *buf = malloc(100); though, because then the only accesses to it would be via char* or in fast-strings functions, as long as all your fast-strings stuff uses the same type. (char* is allowed to alias anything, like __m128i*, or my_aliasing_u64* with a GNU C typedef.)Brebner
@njuffa: for portable C, I'd suggest writing uint64_t aliasing_u64_load(void *p) { uint64_t tmp; memcpy(tmp, p, sizeof(tmp)); return tmp; }. (That also makes unaligned loads safe, so GCC won't always inline it as a single load instruction if it can't prove alignment, on ISAs where unaligned word loads aren't safe.)Brebner
O
2

I don’t know how to convince compiler to emit good auto-vectorized code for that. But I know how to vectorize manually. Since you’re compiling for Skylake, here’s AVX2 version of your function. Untested.

#include <stddef.h>
#include <immintrin.h>

ptrdiff_t findCharAvx2( size_t length, const char* str )
{
    const __m256i andMask = _mm256_set1_epi16( 0xFF );
    const __m256i search1 = _mm256_set1_epi16( '[' );
    const __m256i search2 = _mm256_set1_epi16( ' ' );

    const char* const ptrStart = str;
    const char* const ptrEnd = str + length;
    const char* const ptrEndAligned = str + ( length / 32 ) * 32;
    for( ; str < ptrEndAligned; str += 32 )
    {
        // Load 32 bytes, zero out half of them
        __m256i vec = _mm256_loadu_si256( ( const __m256i * )str );
        vec = _mm256_and_si256( andMask, vec );

        // Compare 16-bit lanes for equality, combine with OR
        const __m256i cmp1 = _mm256_cmpeq_epi16( vec, search1 );
        const __m256i cmp2 = _mm256_cmpeq_epi16( vec, search2 );
        const __m256i any = _mm256_or_si256( cmp1, cmp2 );
        const int mask = _mm256_movemask_epi8( any );

        // If neither character is found, mask will be 0.
        // Otherwise, the least significant set bit = index of the first matching byte in `any` vector
#ifdef _MSC_VER
        unsigned long bitIndex;
        // That's how actual instruction works, it returns 2 things at once, flag and index
        if( 0 == _BitScanForward( &bitIndex, (unsigned long)mask ) )
            continue;
#else
        if( 0 == mask )
            continue;
        const int bitIndex = __builtin_ctz( mask );
#endif
        return ( str - ptrStart ) + bitIndex;
    }

    // Handle the remainder
    for( ; str < ptrEnd; str += 2 )
    {
        const char c = *str;
        if( c == '[' || c == ' ' )
            return str - ptrStart;
    }
    return -1;
}
Olivine answered 6/4, 2021 at 15:55 Comment(12)
Why __builtin_ffs( mask ) - 1; instead of __builtin_ctz( mask )? (count trailing zeros = BSF or TZCNT)Brebner
You could avoid the SIMD AND mask and instead mask the movemask result. Since you branch on it being non-zero anyway, hopefully the compiler can just use and reg,0x55555555/jnz found instead of test reg,reg/jnz found. Can still macro-fuse on Intel but not AMD. Loading with a memory-source VPAND instead of a separate VMOVDQU is cheap (especially if it avoids an indexed addressing mode on Intel so it's still a single micro-fused uop), but it does need another SIMD ALU uop in the back-end.Brebner
vpshufb could duplicate each byte to the containing word instead of AND, setting up for a single vpcmpeqb with set1_epi16('[' << 8 | ' '), then I guess ctz(mask) >> 1. Alternatively, 2x VPAND / VPACKUSWB sets up for 2x compare + OR of 2 vectors at once (and then you have to sort out the data position from in-lane shuffling if you find a hit). But I think if you're going to shuffle to increase data density, VPSHUFB within one vector is best.Brebner
@PeterCordes OK, changed the gcc builtin. Bitwise SIMD instructions are very cheap, the throughput is 1/3 cycles on Intel, 1/4 cycles on AMD. And gcc does fuse the load: godbolt.org/z/78Gqbva8WOlivine
If length >= 32 (or 16 or 8), it should be possible to do cleanup with a final unaligned vector that ends at the end of the array, overlapping and re-checking some number of elements depending on len%32. You only need scalar cleanup if the array is too small for even a single vector. (Even then, you could implement alignment checking / masking.)Brebner
@PeterCordes Yeah, I thought about that trick, decided I don’t like the overhead in complexity One still needs a scalar loop for small input arrays. My version is already way more complicated than OP’s original code.Olivine
I know they're cheap, but you have 5 SIMD ALU operations per loop, and only 3 SIMD ALU ports on Intel. (Although it's a 9-uop loop so even IceLake's 5-wide front-end will have a hard time saturating the ALU ports). Also, you have three 32-byte constants instead of two, so that touches at least 2 cache-lines (because current GCC and clang are dumb and don't load them with vpbroadcastd, like 1 extra code byte each to reduce 32 bytes to 4 bytes.) Anyway, on Intel I'm pretty sure mask &= 0x55555555; is strictly better, except for possible code size / alignment effects. And not bad on AMD.Brebner
Yeah, avoiding scalar cleanup takes more work to code, although a helper function can reduce the repeated work. Still, for small strings like 30 bytes, it's all scalar, or for a 60 byte string it's one vector and 28 scalar iterations. Might be better to just use 128-bit vectors if you expect short strings to be common. (So yeah, tuning strongly depends on your expected use-case. If short strings are important, some kind of alignment-check to handle buffers less than 1 full vector can let you drop the scalar cleanup. e.g. load and check for mask <= 1U<<len if you can avoid a page-split)Brebner
How many characters does it need to scan at a time to be worth using? 8? 100?Violoncello
Is it possible to vectorize this function? godbolt.org/z/h8MjbP1cf It returns ~ every 100 characters.Violoncello
@Cuprite I think even 8 will be faster than scalar code. SIMD is implemented inside CPU cores, latency overhead for passing data between vectors and general-purpose registers is just a couple CPU cycles.Olivine
@Cuprite Of course, it is possible: godbolt.org/z/bPe9n6KvvOlivine

© 2022 - 2024 — McMap. All rights reserved.