It looks like compilers aren't able to realize (and don't add code to check) that data[i]
never points to this->size
. This would make the loop trip-count not calculable ahead of the first iteration, which is something no mainstream compiler except ICC can ever handle.
Hopefully compilers can learn to check for possible overlap before applying their vectorization logic, like (.data > this) || (.data+.size) < this
; hopefully there's an efficient way to do that. They already invent code to check for overlap between two arrays in add2
.
(The more checking code is required at startup, the more profitable vectorization has to be to pay for itself; 64-bit scalar elements aren't that bad with baseline x86-64 compared to 128-bit vectors especially when the compiler doesn't know from PGO that size is usually not small and that the loop is hot. I tried gcc -march=icelake-client
and -march=znver4
to not only enable AVX2 but also set tuning heuristics for CPUs with very good vector throughput and cache/memory bandwidth. But still no joy, so this possible aliasing is probably a full roadblock, not a heuristic decision.)
Asm evidence for compilers being worried about aliasing this->size
Notice how GCC's loop branch is cmp rax, QWORD PTR [rdi+8]
, where rax
holds i
and [rdi+8]
is this->size
(x86-64 SysV calling convention), so it's reloading it every iteration. If we compile with -O3 -fno-tree-vectorize
, we see GCC's add2
loads the size into a register ahead of the loop, comparing against that inside the loop, i.e. hoisting the load. The fact that GCC doesn't do that with add1
is a pretty clear sign that it thinks data[i] += ...
can modify this->size
.
# GCC's add1 inner loop with -O3 -march=icelake-client
.L3:
mov rcx, QWORD PTR [rsi+rax*8]
add QWORD PTR [rdx+rax*8], rcx
inc rax
cmp rax, QWORD PTR [rdi+8]
jb .L3
Also, changing the type to unsigned *data;
or anything that can't point to a size_t
lets GCC, Clang, and ICC auto-vectorize of add1
. Using -fno-strict-aliasing
disables vectorization again. (And makes compilers extra "paranoid", reloading this->data
and other.data
every iteration, like MSVC was always doing. Also breaking vectorization of add2
for those compilers.)
Changing the pointer type doesn't help MSVC because it doesn't do type-based aliasing analysis; it always acts like gcc -fno-strict-aliasing
. MSVC's add2
already does check for more than just overlap of the pointed-to arrays, I think; probably some of that extra cmp/jcc is checking that this->data[i] += ...
won't change the .data
pointer in this
or other
.
std::vector
doesn't have size_t
members, usually avoiding this
std::vector<size_t>
wouldn't have this problem (at least in non-MSVC compilers) because type-based aliasing knows that size_t *
can't point at another pointer. std::vector
normally stores three pointers: .begin
, .end
, and end_of_capacity
, so the size information is end-begin
, not a member storing size directly.
For iterating over one array, normally it's at least as efficient to increment a pointer like for (... ; ptr < endp ; ptr++) *ptr
so you're not using indexed addressing modes. That's presumably why std::vector
is normally laid out that way, rather than a pointer and two size_t
members.
Some RISC machines don't even have two-register indexed addressing modes. For iterating two arrays, some CPUs will do better with fewer instructions just incrementing one index instead of two pointer increments, but it depends on the microarchitecture, e.g. some x86 CPUs un-laminate add reg, [reg + reg]
into 2 uops in the back-end, not keeping it micro-fused, especially with 3-operand AVX instructions.
An efficient way (in asm) to loop over two arrays on CPUs with indexed addressing is to address one array relative to the other. It's UB to do this in C++, and would obfuscate your code, so it's something compilers should do for you. sub rsi, rdi
outside the loop (subtract pointers), then the loop body can be mov eax, [rsi + rdi]
(second array = difference + first) / add [rdi], eax
(first array) / add rdi, 8
(increment the pointer which is also the index for the other array.)
MSVC will actually do this optimization which other compilers haven't picked up yet. (Godbolt)
// Compilers still optimize without __restrict, but it gets rid of the noise of extra checking
void foo(int *__restrict a, int *__restrict b){
for (int i=0 ; i<10240; i++){
a[i] += b[i];
}
}
void foo(int * __restrict,int * __restrict) PROC ; foo, COMDAT
lea rax, QWORD PTR [rcx+32]
sub rdx, rcx ;;;; <---- Pointer subtraction
mov ecx, 320 ; 00000140H
npad 4
$LL4@foo:
vmovdqu ymm1, YMMWORD PTR [rax-32] ;; unrolled with 4 vectors
vpaddd ymm1, ymm1, YMMWORD PTR [rdx+rax-32]
vmovdqu YMMWORD PTR [rax-32], ymm1
vmovdqu ymm2, YMMWORD PTR [rax]
vpaddd ymm1, ymm2, YMMWORD PTR [rdx+rax]
vmovdqu YMMWORD PTR [rax], ymm1
vmovdqu ymm1, YMMWORD PTR [rax+32]
vpaddd ymm1, ymm1, YMMWORD PTR [rdx+rax+32]
vmovdqu YMMWORD PTR [rax+32], ymm1
vmovdqu ymm1, YMMWORD PTR [rax+64]
vpaddd ymm1, ymm1, YMMWORD PTR [rdx+rax+64]
vmovdqu YMMWORD PTR [rax+64], ymm1
lea rax, QWORD PTR [rax+128]
sub rcx, 1
jne SHORT $LL4@foo
vzeroupper
ret 0
Unfortunately, MSVC got it backwards and is using the two-register addressing mode as the memory source operand for vpaddq
. It'll un-laminate at issue/rename into the ROB on Intel Sandybridge-family, including at least Skylake, probably some later. But vpaddd ymm1, ymm1, [rdx]
would be 1 uop. The pure load vmovdqu
would always be 1 uop even with an indexed addressing mode.
Indexed stores aren't ideal either (the store-address uop can't run on port 7 on Haswell / Skylake), and MSVC does avoid that. But it could get the best of both worlds by doing the pure load from b[]
with an indexed addressing mode, and then memory-source vpadd
+ store with the simple addressing mode like [rdx+32]
.
So MSVC did save some code-size and is getting some benefit in back-end throughput by needing only one increment of loop overhead, and in AGU port contention so this can run at close to 1 vector per clock with L1d cache hits to let it do 2 loads + 1 store every cycle (Intel's optimization manual suggests that Skylake can't fully sustain that for 32-byte load/store, for some unknown reason),
With an indexed addressing mode for the store like GCC and clang use, it could only run at 1 vector per 1.5 cycles on Haswell / Skylake. (Ice Lake has two load AGUs and two separate store AGUs, avoiding this bottleneck, but I don't know if unlamination of indexed addressing modes is still a thing on Ice Lake or Alder Lake.)
I don't know what's up with MSVC preferring lea
for incrementing a pointer. Or for preferring sub ecx/jne
instead of comparing against an end-pointer with lea
before the loop instead of mov
. Then the end of the loop could be cmp rax, r8
/jne
or something, which would fuse into a single uop on AMD not just Intel.
size
could be modified externally during execution of the function I suppose? – DoloritasThis is explained by potential aliasing: compilers cannot prove that one of elements pointed by data is not the size itself.
I believe this is the reason. For example whendata
andsize
have different types, it is vectorized. – Gillettastd::atomic
or other multithreading facility, that would give proper atomicity and memory ordering. Note also that pointers can also be concurrently modified, which does not prevent the vectorization ofadd2
. – Darbiesadd2
there are some comparisons already, at least for pointer ranges (and more on MSVC) – Darbiesfor (size_t i = 0; i < size; ++i) { data[i] += 1;}
seems not have code to handle the alias for vectorization. I suppose they have pattern for that kind of optimization and this one is not one of them. – Ethbanstd::vector
are often implemented via a few pointers only, thestd::span
is normally a pointer and a size. So manual (non-range) for loop onstd::span
is susceptible for this too – Darbies[[assume(&va.size < va.data || &va.size >= va.data + va.size)]]
would work. That ought to prevent a runtime check in both cases, making it irrelevant whether the optimizer would insert such a runtime check. – Harald[data, data+size)
and[other.data, other.data+size)
intervals; forsize
member I already acceptadd2
as a source-level workaround. What I actually want is vectorization inadd1
, that is written intuitively, fine if it takes more runtime checks. Anyway, I've tried anyway__assume
in MSVC,__builtin_assume
in Clang, and[[assume]]
in GCC -- with no effect. – Darbies