Why does GCC generate code that conditionally executes a SIMD implementation?
Asked Answered
C

1

28

The following code produces assembly that conditionally executes SIMD in GCC 12.3 when compiled with -O3. For completeness, the code always executes SIMD in GCC 13.2 and never executes SIMD in clang 17.0.1.

#include <array>

__attribute__((noinline)) void fn(std::array<int, 4>& lhs, const std::array<int, 4>& rhs)
{
    for (std::size_t idx = 0; idx != 4; ++idx) {
        lhs[idx] = lhs[idx] + rhs[idx];
    }
}

Here is the link in godbolt.

Here is the actual assembly from GCC 12.3 (with -O3):

fn(std::array<int, 4ul>&, std::array<int, 4ul> const&):
        lea     rdx, [rsi+4]
        mov     rax, rdi
        sub     rax, rdx
        cmp     rax, 8
        jbe     .L2
        movdqu  xmm0, XMMWORD PTR [rsi]
        movdqu  xmm1, XMMWORD PTR [rdi]
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi], xmm0
        ret
.L2:
        mov     eax, DWORD PTR [rsi]
        add     DWORD PTR [rdi], eax
        mov     eax, DWORD PTR [rsi+4]
        add     DWORD PTR [rdi+4], eax
        mov     eax, DWORD PTR [rsi+8]
        add     DWORD PTR [rdi+8], eax
        mov     eax, DWORD PTR [rsi+12]
        add     DWORD PTR [rdi+12], eax
        ret

I am very interested to know a) the purpose of the first 5 assembly instructions and b) if there is anything that can be done to cause GCC 12.3 to emit the code of GCC 13.2 (ideally, without manually writing SSE).

Calyptra answered 16/2, 2024 at 22:17 Comment(2)
It's checking if the arrays partially overlap, which would cause parallel computation to be incorrect. I think this is unnecessary because std::arrays can't partially overlap, but I'm not sure.Transported
@Transported Maybe that's why it was fixed in a later version of GCC, the check is unnecessary.Harbison
C
31

It seems GCC12 is treating the class reference like it would a simple int *, in terms of whether lhs and rhs could partially overlap.

Exact overlap would be fine, if lhs[idx] is the same int as rhs[idx], we read it twice before writing it. But with partial overlap, rhs[3] for example could have been updated by one of the lhs[0..2] additions, which wouldn't happen with SIMD if we did all the loads first before any of the stores.

GCC13 knows that class objects aren't allowed to partially overlap (except for common initial sequence stuff for different struct/class types, which I think doesn't apply here). That would be UB so it can assume it doesn't happen. GCC12's code-gen is a missed optimization.


So how do we help GCC12? The usual go-to is __restrict for removing overlap checks or enabling auto-vectorization at all when the compiler doesn't want to invent checks + a fallback. In C, restrict is part of the language, but in C++ it's only an extension. (Supported by the major mainstream compilers, and you can use the preprocessor to #define it to the empty string on others.) You can use __restrict with references as well as pointers. (At least GCC and Clang accept it with no warnings at -Wall; I didn't double-check the docs to be sure this is standard.)

// downside: fn_restrict(same, same) would be UB
void fn_restrict(std::array<int, 4>&__restrict lhs, const std::array<int, 4>& rhs)
{
    for (std::size_t idx = 0; idx != 4; ++idx) {
        lhs[idx] = lhs[idx] + rhs[idx];
    }
}

Or manually read all of lhs before writing any of it

Since your array is small enough to fit in one SIMD register, there's no inefficiency in copying. This would be bad for array<int, 1000> or something!

// downside: only efficient for small arrays that fit in a few vector regs at most
void fn_temporary(std::array<int, 4>& lhs, const std::array<int, 4>& rhs)
{
    auto sum = lhs;    // read the possibly-aliasing data into a temporary
    for (std::size_t idx = 0; idx != 4; ++idx) {
        sum[idx] += rhs[idx];  // update the temporary
    }
    lhs = sum;   // store back, after all loads
}

Both of these compile to the same auto-vectorized asm as GCC13, with no wasted instructions (Godbolt)

# GCC12 -O3
fn_temporary(std::array<int, 4ul>&, std::array<int, 4ul> const&):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movdqu  xmm1, XMMWORD PTR [rdi]
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi], xmm0
        ret

Promising alignment (like alignas(16) one one of the types?) could let it use paddd xmm1, [rdi], a memory source operand, without AVX.

Chagall answered 16/2, 2024 at 22:44 Comment(2)
A potential danger with using restrict or __restrict is that if a block of code is conditionally executed depending upon whether a pointer which is based on a certain restrict-qualified pointer is equal to another which is not, the Standard's definition of "based upon" will be meaningless within block. Thus, if a function would accept arguments void test(int *restrict p, int *restrict q) and include special logic to handle the p==q case by using p to access anything, clang or gcc may arbitrarily replace some of p references to q references and assume they won't affect...Fi
...storage which, in the code as written, had been accessed exclusively using lvalues of the form p[intValue]. I don't think the authors of C99 would have accepted the textual definition of "based upon" if they'd known compilers would interpret it this way, but I can't imagine the authors of the Standard coming out now and saying that clang and gcc are misinterpreting it.Fi

© 2022 - 2025 — McMap. All rights reserved.