Why do compilers miss vectorization here?
Asked Answered
D

1

14

Consider the following valarray-like class:

#include <stdlib.h>

struct va
{
    void add1(const va& other);
    void add2(const va& other);

    size_t* data;
    size_t  size;
};

void va::add1(const va& other) {
    for (size_t i = 0; i < size; ++i) {
        data[i] += other.data[i];
    }
}

void va::add2(const va& other){
    for (size_t i = 0, s = size; i < s; ++i) {
        data[i] += other.data[i];
    }
}

The add2 function is vectorized for different compilers (MSVC, Clang, GCC, ICC), whereas add1 is not. See https://godbolt.org/z/c61qvrrbv

This is explained by potential aliasing: compilers cannot prove that one of elements pointed by data is not the size itself.

However, there's also potential overlap of elements pointed by data and other.data. For MSVC there is potential aliasing these elements and pointers themselves, as it does not take advantage of the Strict Aliasing Rule. This applies to both add1 and add2.

The compilers are performing checks for all possible aliasing they suspect and execute vectorized operation for add2.

Why aren't they adding more checks and having this optimization for add1?

Darbies answered 17/8, 2023 at 10:58 Comment(9)
size could be modified externally during execution of the function I suppose?Doloritas
This is explained by potential aliasing: compilers cannot prove that one of elements pointed by data is not the size itself. I believe this is the reason. For example when data and size have different types, it is vectorized.Gilletta
@AlanBirtles, but this would be a race condition, so compiler is free to do whatever in this case. For concurrent modification, one should use std::atomic or other multithreading facility, that would give proper atomicity and memory ordering. Note also that pointers can also be concurrently modified, which does not prevent the vectorization of add2.Darbies
@KamilCuk, but the compiler can perform some comparisons to ensure that. Note that in add2 there are some comparisons already, at least for pointer ranges (and more on MSVC)Darbies
Not familiar with asm, but even for (size_t i = 0; i < size; ++i) { data[i] += 1;} seems not have code to handle the alias for vectorization. I suppose they have pattern for that kind of optimization and this one is not one of them.Ethban
@Jarod42, quite plausible. But then too bad they don't have such a pattern. Whereas contiguous containers like std::vector are often implemented via a few pointers only, the std::span is normally a pointer and a size. So manual (non-range) for loop on std::span is susceptible for this tooDarbies
You'd wish that an [[assume(&va.size < va.data || &va.size >= va.data + va.size)]] would work. That ought to prevent a runtime check in both cases, making it irrelevant whether the optimizer would insert such a runtime check.Harald
@Harald this sounds an interesting idea to avoid the overlap checks between [data, data+size) and [other.data, other.data+size) intervals; for size member I already accept add2 as a source-level workaround. What I actually want is vectorization in add1, that is written intuitively, fine if it takes more runtime checks. Anyway, I've tried anyway __assume in MSVC, __builtin_assume in Clang, and [[assume]] in GCC -- with no effect.Darbies
@AlexGuteniev: A better solution would be to recognize that there is no general permission to access structures through seemingly-unrelated pointers of member-object type, but then recognize that the rules which would apply to unrelated pointers don't apply for pointers which are freshly visibly derived from a pointer or lvalue of another type.Rottweiler
A
10

It looks like compilers aren't able to realize (and don't add code to check) that data[i] never points to this->size. This would make the loop trip-count not calculable ahead of the first iteration, which is something no mainstream compiler except ICC can ever handle.

Hopefully compilers can learn to check for possible overlap before applying their vectorization logic, like (.data > this) || (.data+.size) < this; hopefully there's an efficient way to do that. They already invent code to check for overlap between two arrays in add2.

(The more checking code is required at startup, the more profitable vectorization has to be to pay for itself; 64-bit scalar elements aren't that bad with baseline x86-64 compared to 128-bit vectors especially when the compiler doesn't know from PGO that size is usually not small and that the loop is hot. I tried gcc -march=icelake-client and -march=znver4 to not only enable AVX2 but also set tuning heuristics for CPUs with very good vector throughput and cache/memory bandwidth. But still no joy, so this possible aliasing is probably a full roadblock, not a heuristic decision.)


Asm evidence for compilers being worried about aliasing this->size

Notice how GCC's loop branch is cmp rax, QWORD PTR [rdi+8], where rax holds i and [rdi+8] is this->size (x86-64 SysV calling convention), so it's reloading it every iteration. If we compile with -O3 -fno-tree-vectorize, we see GCC's add2 loads the size into a register ahead of the loop, comparing against that inside the loop, i.e. hoisting the load. The fact that GCC doesn't do that with add1 is a pretty clear sign that it thinks data[i] += ... can modify this->size.

# GCC's add1 inner loop with -O3   -march=icelake-client
.L3:
        mov     rcx, QWORD PTR [rsi+rax*8]
        add     QWORD PTR [rdx+rax*8], rcx
        inc     rax
        cmp     rax, QWORD PTR [rdi+8]
        jb      .L3

Also, changing the type to unsigned *data; or anything that can't point to a size_t lets GCC, Clang, and ICC auto-vectorize of add1. Using -fno-strict-aliasing disables vectorization again. (And makes compilers extra "paranoid", reloading this->data and other.data every iteration, like MSVC was always doing. Also breaking vectorization of add2 for those compilers.)

Changing the pointer type doesn't help MSVC because it doesn't do type-based aliasing analysis; it always acts like gcc -fno-strict-aliasing. MSVC's add2 already does check for more than just overlap of the pointed-to arrays, I think; probably some of that extra cmp/jcc is checking that this->data[i] += ... won't change the .data pointer in this or other.


std::vector doesn't have size_t members, usually avoiding this

std::vector<size_t> wouldn't have this problem (at least in non-MSVC compilers) because type-based aliasing knows that size_t * can't point at another pointer. std::vector normally stores three pointers: .begin, .end, and end_of_capacity, so the size information is end-begin, not a member storing size directly.


For iterating over one array, normally it's at least as efficient to increment a pointer like for (... ; ptr < endp ; ptr++) *ptr so you're not using indexed addressing modes. That's presumably why std::vector is normally laid out that way, rather than a pointer and two size_t members.

Some RISC machines don't even have two-register indexed addressing modes. For iterating two arrays, some CPUs will do better with fewer instructions just incrementing one index instead of two pointer increments, but it depends on the microarchitecture, e.g. some x86 CPUs un-laminate add reg, [reg + reg] into 2 uops in the back-end, not keeping it micro-fused, especially with 3-operand AVX instructions.

An efficient way (in asm) to loop over two arrays on CPUs with indexed addressing is to address one array relative to the other. It's UB to do this in C++, and would obfuscate your code, so it's something compilers should do for you. sub rsi, rdi outside the loop (subtract pointers), then the loop body can be mov eax, [rsi + rdi] (second array = difference + first) / add [rdi], eax (first array) / add rdi, 8 (increment the pointer which is also the index for the other array.)

MSVC will actually do this optimization which other compilers haven't picked up yet. (Godbolt)

// Compilers still optimize without __restrict, but it gets rid of the noise of extra checking
void foo(int *__restrict a, int *__restrict b){
    for (int i=0 ; i<10240; i++){
        a[i] += b[i];
    }
}
void foo(int * __restrict,int * __restrict) PROC                              ; foo, COMDAT
        lea     rax, QWORD PTR [rcx+32]
        sub     rdx, rcx       ;;;; <---- Pointer subtraction
        mov     ecx, 320                      ; 00000140H
        npad    4
$LL4@foo:
        vmovdqu ymm1, YMMWORD PTR [rax-32]            ;; unrolled with 4 vectors
        vpaddd  ymm1, ymm1, YMMWORD PTR [rdx+rax-32]
        vmovdqu YMMWORD PTR [rax-32], ymm1

        vmovdqu ymm2, YMMWORD PTR [rax]
        vpaddd  ymm1, ymm2, YMMWORD PTR [rdx+rax]
        vmovdqu YMMWORD PTR [rax], ymm1

        vmovdqu ymm1, YMMWORD PTR [rax+32]
        vpaddd  ymm1, ymm1, YMMWORD PTR [rdx+rax+32]
        vmovdqu YMMWORD PTR [rax+32], ymm1

        vmovdqu ymm1, YMMWORD PTR [rax+64]
        vpaddd  ymm1, ymm1, YMMWORD PTR [rdx+rax+64]
        vmovdqu YMMWORD PTR [rax+64], ymm1

        lea     rax, QWORD PTR [rax+128]
        sub     rcx, 1
        jne     SHORT $LL4@foo
        vzeroupper
        ret     0

Unfortunately, MSVC got it backwards and is using the two-register addressing mode as the memory source operand for vpaddq. It'll un-laminate at issue/rename into the ROB on Intel Sandybridge-family, including at least Skylake, probably some later. But vpaddd ymm1, ymm1, [rdx] would be 1 uop. The pure load vmovdqu would always be 1 uop even with an indexed addressing mode.

Indexed stores aren't ideal either (the store-address uop can't run on port 7 on Haswell / Skylake), and MSVC does avoid that. But it could get the best of both worlds by doing the pure load from b[] with an indexed addressing mode, and then memory-source vpadd + store with the simple addressing mode like [rdx+32].

So MSVC did save some code-size and is getting some benefit in back-end throughput by needing only one increment of loop overhead, and in AGU port contention so this can run at close to 1 vector per clock with L1d cache hits to let it do 2 loads + 1 store every cycle (Intel's optimization manual suggests that Skylake can't fully sustain that for 32-byte load/store, for some unknown reason),

With an indexed addressing mode for the store like GCC and clang use, it could only run at 1 vector per 1.5 cycles on Haswell / Skylake. (Ice Lake has two load AGUs and two separate store AGUs, avoiding this bottleneck, but I don't know if unlamination of indexed addressing modes is still a thing on Ice Lake or Alder Lake.)

I don't know what's up with MSVC preferring lea for incrementing a pointer. Or for preferring sub ecx/jne instead of comparing against an end-pointer with lea before the loop instead of mov. Then the end of the loop could be cmp rax, r8/jne or something, which would fuse into a single uop on AMD not just Intel.

Attic answered 17/8, 2023 at 18:56 Comment(10)
std::vector doesn't have size_t members, usually avoiding this Right. I've encountered it on MSVC valarray. It fortunately saves the size before the loop, and my currently PR will add an explanation so that it is not lost. libc++ valarray uses two pointers, so it is fine. However, many STL and non-STL std::span implementations are pointer and size pair, so the problem gains some impact with C++17 adoption.Darbies
64-bit scalar elements aren't that bad with baseline x86-64 compared to 128-bit vectors - you could make size and elements 32-bit or 16-bit to make scalar version less appealingDarbies
@AlexGuteniev: Yes, and in that case compilers other than MSVC auto-vectorize because those types aren't alias-compatible with size_t. Fair point that it's an MSVC question, and MSVC itself will suffer more in those cases. Also, with uint8_t *data, that's unsigned char* which can alias anything, giving problems for GCC / Clang / ICC.Attic
No, I mean, like uint16_t* data; uint16_t size;Darbies
@AlexGuteniev: Oh right, you said both size and elements. Yes, that's a fair point. Pointers are 8 bytes and aligned, so you'd be leaving wasted space in this struct with a narrower size, but other structs could pack other members next to the size.Attic
C++ does not work this way, the padding will be always wasted: godbolt.org/z/4soPY3K7f . Anyway the padding does not convince the compilers that the size is not an element of the destination array. (And sure the structure with a pointer and a 16-bit size is not for practical purposes, only for this experiment)Darbies
@AlexGuteniev: Oh right, for inheritance or composition yes, because code using a pointer to the sub-object type is allowed to assume that padding is always padding. I meant struct { int *data; uint32_t size; uint8_t foo, bar; } which only has 2 bytes of padding. So that doesn't help for std::span, but std::span will use size_t anyway.Attic
re choice which operation uses two-register addressing. I think MSVC tries to use it as little as possible, so avoiding indexed store not by a coincidence, but still does not care which of the operation is two-reg addressed. Compare loops with a[i] = a[i] - b[i] (two-regs addressed load) and with a[i] = b[i] - a[i] (two-regs addressed subtract)Darbies
Looks like with the Clang level of loop unrolling it could just increment two pointers and avoid two-regs addressing completely, the cost of additional increment would have been minimalDarbies
@AlexGuteniev: Yes, it's crazy that clang keeps using indexed addressing modes when it unrolls so much, at least with -mtune=skylake. Maybe they're mostly benchmarking and tuning on AMD CPUs which don't have any downside for indexed addressing modes? It would probably be break-even for code-size to do an extra increment but save a SIB byte in every memory access, at the cost of 1 extra instruction in the loop. (And thus 1 extra uop on CPUs that keep them fused, but saving many uops on HSW/SKL at least, maybe also ICL and later if they also unlaminate)Attic

© 2022 - 2024 — McMap. All rights reserved.