When your input and output don't overlap, and it's safe to re-process the same element multiple times, this general idea is great. This is often the case when the output is write-only. e.g. out[i] = pure_func(in[i])
is idempotent, but out[i] += func(in[i])
is not. Reductions like sum += in[i]
are also less amenable.
When it's usable, it's often better than a scalar cleanup loop.
When it's not quite this simple, see @Paul R's comments, and this related question: Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all (TL:DR: actually using vmaskmovps
isn't usually good, but other masking and unaligned-load tricks are.)
Your specific implementation (of making the same loop reusable for overlapping last vector) ends up making the inner loop quite bad with clang, doing i+8
and i+4
inside each inner-loop iteration.
gcc manages make a slightly less bad inner loop, but it's still less efficient than it could be with gcc7.2 -O3 -mtune=haswell
(asm output on Godbolt). The inner loop has extra overhead because it's saving the old i
every time it does i += 4
, because of CSE between that and the i+4
in the loop condition, and/or the i = count - 4
outside the loop. (gcc is sometimes pretty dumb about putting extra work inside an inner loop instead of recalculating or undoing an operation after.)
Source + asm on the Godbolt compiler explorer (for the original and an improved version (see below)).
# Your original source, built with gcc7.2 -O3
# we get here with some registers already set up when count >= 4
.L2: # top of outer "loop"
lea rcx, [rax+4]
cmp rcx, rdx
ja .L4
.L17: # Inner loop
movdqu xmm0, XMMWORD PTR [rdi+rax*4]
paddd xmm0, xmm1
movups XMMWORD PTR [rsi+rax*4], xmm0
mov rax, rcx # save RAX for use outside the loop!
lea rcx, [rax+4] # 3 uops of loop overhead
cmp rcx, rdx
jbe .L17
.L4:
# missed optimization: do rcx-4 here instead of extra work in every inner-loop iteration
cmp rax, rdx
je .L1 # ret if we're done (whole number of vectors)
mov rax, r8
jmp .L2 # else back into the loop for the last vector
Using an indexed addressing mode doesn't particularly hurt with SSE2, but is not a good thing with AVX. AVX allows unaligned memory operands to any instruction (except vmovdqa
), so the compiler can fold the load into vpaddd xmm0, xmm1, [rdi+rax*4]
if you build with -march=haswell
. But that can't micro-fuse even on Haswell, so it's still 2 uops for the front-end.
We can fix clang's and gcc's inner loops by using i <= count - 4
. We know count >= 4
at this point, count - 4
will never wrap to a huge number. (Note that i + 4
could wrap and thus create an infinite loop if count
was within 4 of the max value for the type. This is probably what was giving clang such a hard time and leading to missed optimizations).
Now we get an identical inner loop from gcc7.2 and clang5.0 (both using -O3 -march=haswell
). (Literally identical, even using the same registers, just a different label name.)
.L16:
vpaddd xmm0, xmm1, XMMWORD PTR [rdi+rax*4]
vmovups XMMWORD PTR [rsi+rax*4], xmm0
add rax, 4
cmp rax, rcx
jbe .L16
This is 5 fused-domain uops on Haswell, so can run at best one iteration per 1.25 clocks, bottlenecked on the front-end, not on load, store, or SIMD paddd
throughput. (It may bottleneck on memory bandwidth over large inputs, but unrolling at least a bit is generally a good thing even for that case.)
clang would have unrolled for you when auto-vectorizing, and used AVX2, so you're actually getting worse asm by manually vectorizing in this case where the compiler can do it easily. (Unless you build with gcc -O2
, which doesn't enable auto-vectorization).
You can actually see an example of how clang auto-vectorizes, because it vectorizes the cleanup loop, for some reason not realizing that it can't ever run with count >= 4
. Yes, it checks if count > 3
and jumps to the manually-vectorized loop, then check if it's 0
, then checks if it's > 32
and jumps to an auto-vectorized version of the cleanup loop... /facepalm.)
Actually jumping back into the main loop is mostly incompatible with unrolling in the C source, and will probably defeat compiler unrolling.
As noted above, unrolling the inner loop is usually a win.
In asm, you could certainly set things up so you could jump back into the inner loop for the last 1 or 2 vectors after an unrolled loop, and then maybe again for an unaligned final vector.
This may be bad for branch-prediction, though. The loop-branch will probably be predicted strongly taken, so might mispredict each time you jump back into the loop. Separate cleanup code won't have that problem, so if it's a real problem then separate cleanup code (duplicating the inner loop body) will be better.
You can usually wrap your inner loop logic in an inline function that you can use multiple times in an unrolled loop, and once in a cleanup / last-unaligned vector block.
Although your idea doesn't appear the hurt the inner loop in this case if you do it carefully, the amount of extra instructions and extra branching before/after the inner loop is probably more than you'd get with a simpler approach to cleanup. So again, this might be useful if the loop body was very large, but in this case it's only a few instructions and cheaper to duplicate than to branch around.
One efficient way to solve the same problem is with a possibly-overlapping last vector, so there's no conditional branch that depends on whether the element count is a whole number of full vectors or not. The final vector has to use unaligned load/store instructions because you don't know its alignment.
On modern x86 (Intel since Nehalem, AMD since Bulldozer, see Agner Fog's guides), unaligned load/store instructions on pointers that are actually aligned at runtime have no penalty. (Unlike on Core2 where movdqu
is always slower even if the data is actually aligned.) IDK what the situation is on ARM, MIPS, or other SIMD architectures.
You can use this same idea to handle a possibly-unaligned first vector (which doesn't overlap if it is in fact aligned) and then make the main loop use aligned vectors. With two pointers, one might be misaligned relative to the other, though. The usual advice (from Intel's optimization manual) is to align the output pointer (that you're storing through.)
You can and should use __restrict
on both pointers. No reason not to unless it actually can alias something else. If I was only going to do one, I'd use it on the output pointer, but it can matter to do both.