Jump back some iterations for vectorized remainder loop
Asked Answered
V

1

9

I'm vectorizing a loop by hand and processing 4 items at a time. The total number of items may not be a multiple of 4, so I have some items left at the end of my main loop. I was thinking that I could do the remainder items with the same main vectorized loop if the count is larger than 4 and redoing some items is safe. For example, If I need to process 10 items, I could process 0123, 4567, 6789 in three iterations. I couldn't find any references to this technique. Is it dumb but I don't see how?

#include <stdint.h>
#include <stddef.h>

void inc(int32_t const* __restrict in, int32_t* out, size_t count)
{
    if (count < 4)
    {
        for (size_t i = 0; i < count; ++i)
            out[i] = in[i] + 42;
    }
    else
    {
        typedef int32_t v4i __attribute__ ((vector_size (16), aligned(4)));
        for (size_t i = 0;;)
        {
            for (; i + 4 <= count; i += 4)
            {
                (v4i&)out[i] = (v4i&)in[i] + 42;
            }
            if (i == count)
                break;
            i = count - 4;
        }

    }
}
Vespucci answered 17/11, 2017 at 14:52 Comment(9)
It's more usual to process fake 10, 11 to make it up to 12 (just make sure your initial data structures are aligned and have enough reserve to incorporate the junk).Limonene
"0123, 4567, __67__89" Beware that re-doing part of the second group could introduce alignment issues. Why not append a few dummy items?Mukden
@Limonene You can't pad the array when it's actually a subarray of a larger one.Vespucci
@Mukden I didn't understand your __ syntax.Vespucci
@BrunoMartinez maybe then copy the 2 pieces into small local 4-element buffer, and copy result back. (or other way, copy 2 excessive elements, let them be destroyed by calculation, and copy original values back to restore them). ... anyway, the alignment is probably too important... it may be even much better to have single-element code variant to process only 8 and 9, not having any fake 10 and 11. ... but avoiding unaligned subarrays by having some padding or more clever structure of data is often first choice - if possible of course. At some stage it may inevitable to process unaligned 1 elLimonene
@Limonene I think the intel compiler uses unaligned reads even when it knows that data is aligned and an aligned read would work. If data is aligned, loadu is as fast as loada. The same loop can work optimally on aligned and unaligned data.Vespucci
This is a perfectly valid technique, but note that it fails if count < 4, and it doesn't work in general for all vectorized operations. Other approaches are: use a mask for the final partial vector so that you only process/store the required elements; or use scalar code for the final 1..3 elements.Mullis
@PaulR I thought about the mask, but I'm not sure if you can reuse the same loop without losing efficiency. Are operations with all lanes enabled as fast as unmasked operations? At least I lose a register for the mask.Vespucci
I'm really only talking about the problem in general - you need to decide which method to use on a case-by-case basis - it will depend on the algorithm, the target architecture, and various other factors. In this particular case I think your approach is probably fine, but don't regard it as a panacea.Mullis
I
5

When your input and output don't overlap, and it's safe to re-process the same element multiple times, this general idea is great. This is often the case when the output is write-only. e.g. out[i] = pure_func(in[i]) is idempotent, but out[i] += func(in[i]) is not. Reductions like sum += in[i] are also less amenable.

When it's usable, it's often better than a scalar cleanup loop.

When it's not quite this simple, see @Paul R's comments, and this related question: Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all (TL:DR: actually using vmaskmovps isn't usually good, but other masking and unaligned-load tricks are.)


Your specific implementation (of making the same loop reusable for overlapping last vector) ends up making the inner loop quite bad with clang, doing i+8 and i+4 inside each inner-loop iteration.

gcc manages make a slightly less bad inner loop, but it's still less efficient than it could be with gcc7.2 -O3 -mtune=haswell (asm output on Godbolt). The inner loop has extra overhead because it's saving the old i every time it does i += 4, because of CSE between that and the i+4 in the loop condition, and/or the i = count - 4 outside the loop. (gcc is sometimes pretty dumb about putting extra work inside an inner loop instead of recalculating or undoing an operation after.)

Source + asm on the Godbolt compiler explorer (for the original and an improved version (see below)).

# Your original source, built with gcc7.2 -O3
# we get here with some registers already set up when count >= 4
.L2:                     # top of outer "loop"
    lea     rcx, [rax+4]
    cmp     rcx, rdx
    ja      .L4

.L17:                    # Inner loop
    movdqu  xmm0, XMMWORD PTR [rdi+rax*4]
    paddd   xmm0, xmm1
    movups  XMMWORD PTR [rsi+rax*4], xmm0
    mov     rax, rcx               # save RAX for use outside the loop!
    lea     rcx, [rax+4]           # 3 uops of loop overhead
    cmp     rcx, rdx
    jbe     .L17

.L4:
    # missed optimization: do rcx-4 here instead of extra work in every inner-loop iteration
    cmp     rax, rdx
    je      .L1             # ret if we're done (whole number of vectors)
    mov     rax, r8
    jmp     .L2             # else back into the loop for the last vector

Using an indexed addressing mode doesn't particularly hurt with SSE2, but is not a good thing with AVX. AVX allows unaligned memory operands to any instruction (except vmovdqa), so the compiler can fold the load into vpaddd xmm0, xmm1, [rdi+rax*4] if you build with -march=haswell. But that can't micro-fuse even on Haswell, so it's still 2 uops for the front-end.


We can fix clang's and gcc's inner loops by using i <= count - 4. We know count >= 4 at this point, count - 4 will never wrap to a huge number. (Note that i + 4 could wrap and thus create an infinite loop if count was within 4 of the max value for the type. This is probably what was giving clang such a hard time and leading to missed optimizations).

Now we get an identical inner loop from gcc7.2 and clang5.0 (both using -O3 -march=haswell). (Literally identical, even using the same registers, just a different label name.)

.L16:
    vpaddd  xmm0, xmm1, XMMWORD PTR [rdi+rax*4]
    vmovups XMMWORD PTR [rsi+rax*4], xmm0
    add     rax, 4
    cmp     rax, rcx
    jbe     .L16

This is 5 fused-domain uops on Haswell, so can run at best one iteration per 1.25 clocks, bottlenecked on the front-end, not on load, store, or SIMD paddd throughput. (It may bottleneck on memory bandwidth over large inputs, but unrolling at least a bit is generally a good thing even for that case.)

clang would have unrolled for you when auto-vectorizing, and used AVX2, so you're actually getting worse asm by manually vectorizing in this case where the compiler can do it easily. (Unless you build with gcc -O2, which doesn't enable auto-vectorization).

You can actually see an example of how clang auto-vectorizes, because it vectorizes the cleanup loop, for some reason not realizing that it can't ever run with count >= 4. Yes, it checks if count > 3 and jumps to the manually-vectorized loop, then check if it's 0, then checks if it's > 32 and jumps to an auto-vectorized version of the cleanup loop... /facepalm.)


Actually jumping back into the main loop is mostly incompatible with unrolling in the C source, and will probably defeat compiler unrolling.

As noted above, unrolling the inner loop is usually a win.

In asm, you could certainly set things up so you could jump back into the inner loop for the last 1 or 2 vectors after an unrolled loop, and then maybe again for an unaligned final vector.

This may be bad for branch-prediction, though. The loop-branch will probably be predicted strongly taken, so might mispredict each time you jump back into the loop. Separate cleanup code won't have that problem, so if it's a real problem then separate cleanup code (duplicating the inner loop body) will be better.

You can usually wrap your inner loop logic in an inline function that you can use multiple times in an unrolled loop, and once in a cleanup / last-unaligned vector block.

Although your idea doesn't appear the hurt the inner loop in this case if you do it carefully, the amount of extra instructions and extra branching before/after the inner loop is probably more than you'd get with a simpler approach to cleanup. So again, this might be useful if the loop body was very large, but in this case it's only a few instructions and cheaper to duplicate than to branch around.


One efficient way to solve the same problem is with a possibly-overlapping last vector, so there's no conditional branch that depends on whether the element count is a whole number of full vectors or not. The final vector has to use unaligned load/store instructions because you don't know its alignment.

On modern x86 (Intel since Nehalem, AMD since Bulldozer, see Agner Fog's guides), unaligned load/store instructions on pointers that are actually aligned at runtime have no penalty. (Unlike on Core2 where movdqu is always slower even if the data is actually aligned.) IDK what the situation is on ARM, MIPS, or other SIMD architectures.

You can use this same idea to handle a possibly-unaligned first vector (which doesn't overlap if it is in fact aligned) and then make the main loop use aligned vectors. With two pointers, one might be misaligned relative to the other, though. The usual advice (from Intel's optimization manual) is to align the output pointer (that you're storing through.)


You can and should use __restrict on both pointers. No reason not to unless it actually can alias something else. If I was only going to do one, I'd use it on the output pointer, but it can matter to do both.

Inefficacious answered 17/11, 2017 at 21:30 Comment(2)
Yeah I've used the "jump back" trick in asm before: if your loop body is really regular in code size you can even calculate the jump back amount directly without needing a lookup from a jump table with labels. Another approach is to solve the problem at the start: jump in to the middle of the loop based on the remainder. This often a bit cleaner: with the jump-back at the end you need another branch around it when you don't want to jump back (i.e., the loop is really done). With the jump-in you have no such problem (but nasm is fussier about calculating the addresses of later labels).Rochdale
Good luck getting a compiler to generate those though! For hot code with moderate loop counts and unrepredicable remainers these techniques aren't great because of the extra mispredict they usually imply. Unaligned is probably much better since it avoids the mis-predict. They are good for less frequently executed code or where the loop remainder is very predictable. One downside of reusing the loop is that it limits the type of optimizations you can do: combining operations between iterations in the unrolled loop might make it unsuitable for jump-back.Rochdale

© 2022 - 2024 — McMap. All rights reserved.