Why does GCC generate code that conditionally executes a SIMD implementation?

#include <array> __attribute__((noinline)) void fn(std::array<int, 4>& lhs, const std::array<int, 4>& rhs) { for (std::size_t idx = 0; idx != 4; ++idx) { lhs[idx] = lhs[idx] + rhs[idx]; } }

fn(std::array<int, 4ul>&, std::array<int, 4ul> const&): lea rdx, [rsi+4] mov rax, rdi sub rax, rdx cmp rax, 8 jbe .L2 movdqu xmm0, XMMWORD PTR [rsi] movdqu xmm1, XMMWORD PTR [rdi] paddd xmm0, xmm1 movups XMMWORD PTR [rdi], xmm0 ret .L2: mov eax, DWORD PTR [rsi] add DWORD PTR [rdi], eax mov eax, DWORD PTR [rsi+4] add DWORD PTR [rdi+4], eax mov eax, DWORD PTR [rsi+8] add DWORD PTR [rdi+8], eax mov eax, DWORD PTR [rsi+12] add DWORD PTR [rdi+12], eax ret

It seems GCC12 is treating the class reference like it would a simple int *, in terms of whether lhs and rhs could partially overlap.

Exact overlap would be fine, if lhs[idx] is the same int as rhs[idx], we read it twice before writing it. But with partial overlap, rhs[3] for example could have been updated by one of the lhs[0..2] additions, which wouldn't happen with SIMD if we did all the loads first before any of the stores.

GCC13 knows that class objects aren't allowed to partially overlap (except for common initial sequence stuff for different struct/class types, which I think doesn't apply here). That would be UB so it can assume it doesn't happen. GCC12's code-gen is a missed optimization.

So how do we help GCC12? The usual go-to is __restrict for removing overlap checks or enabling auto-vectorization at all when the compiler doesn't want to invent checks + a fallback. In C, restrict is part of the language, but in C++ it's only an extension. (Supported by the major mainstream compilers, and you can use the preprocessor to #define it to the empty string on others.) You can use __restrict with references as well as pointers. (At least GCC and Clang accept it with no warnings at -Wall; I didn't double-check the docs to be sure this is standard.)

// downside: fn_restrict(same, same) would be UB
void fn_restrict(std::array<int, 4>&__restrict lhs, const std::array<int, 4>& rhs)
{
    for (std::size_t idx = 0; idx != 4; ++idx) {
        lhs[idx] = lhs[idx] + rhs[idx];
    }
}

Or manually read all of `lhs` before writing any of it

Since your array is small enough to fit in one SIMD register, there's no inefficiency in copying. This would be bad for array<int, 1000> or something!

// downside: only efficient for small arrays that fit in a few vector regs at most
void fn_temporary(std::array<int, 4>& lhs, const std::array<int, 4>& rhs)
{
    auto sum = lhs;    // read the possibly-aliasing data into a temporary
    for (std::size_t idx = 0; idx != 4; ++idx) {
        sum[idx] += rhs[idx];  // update the temporary
    }
    lhs = sum;   // store back, after all loads
}

Both of these compile to the same auto-vectorized asm as GCC13, with no wasted instructions (Godbolt)

# GCC12 -O3
fn_temporary(std::array<int, 4ul>&, std::array<int, 4ul> const&):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movdqu  xmm1, XMMWORD PTR [rdi]
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi], xmm0
        ret

Promising alignment (like alignas(16) one one of the types?) could let it use paddd xmm1, [rdi], a memory source operand, without AVX.

Or manually read all of `lhs` before writing any of it

Recommended topics

Hot tags

Or manually read all of lhs before writing any of it

Recommended topics

Hot tags

Or manually read all of `lhs` before writing any of it