Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

C

2

15

gcc 5.3 with -O3 -mavx -mtune=haswell for x86-64 makes surprisingly bulky code to handle potentially-misaligned inputs for code like:

// convenient simple example of compiler input
// I'm not actually interested in this for any real program
void floatmul(float *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}

clang uses unaligned load/store instructions, but gcc does a scalar intro/outro and an aligned vector loop: It peels off the first up-to-7 unaligned iterations, fully unrolling that into a sequence of

    vmovss  xmm0, DWORD PTR [rdi]
    vaddss  xmm0, xmm0, xmm0      ; multiply by two
    vmovss  DWORD PTR [rdi], xmm0
    cmp     eax, 1
    je      .L13
    vmovss  xmm0, DWORD PTR [rdi+4]
    vaddss  xmm0, xmm0, xmm0
    vmovss  DWORD PTR [rdi+4], xmm0
    cmp     eax, 2
    je      .L14
    ...

This seems pretty terrible, esp. for CPUs with a uop cache. I reported a gcc bug about this, with a suggestion for smaller/better code that gcc could use when peeling unaligned iterations. It's probably still not optimal, though.

This question is about what actually would be optimal with AVX. I'm asking about general-case solutions that gcc and other compilers could/should use. (I didn't find any gcc mailing list hits with discussion about this, but didn't spend long looking.)

There will probably be multiple answers, since what's optimal for -mtune=haswell will probably be different from what's optimal for -mtune=bdver3 (steamroller). And then there's the question of what's optimal when allowing instruction set extensions (e.g. AVX2 for 256b integer stuff, BMI1 for turning a count into a bitmask in fewer instructions).

I'm aware of Agner Fog's Optimizing Assembly guide, Section 13.5 Accessing unaligned data and partial vectors. He suggests either using unaligned accesses, doing an overlapping write at the start and/or end, or shuffling data from aligned accesses (but PALIGNR only takes an imm8 count, so 2x pshufb / por). He discounts VMASKMOVPS as not useful, probably because of how badly it performs on AMD. I suspect that if tuning for Intel, it's worth considering. It's not obvious how to generate the correct mask, hence the question title.

It might turn out that it's better to simply use unaligned accesses, like clang does. For short buffers, the overhead of aligning might kill any benefit from avoiding cacheline splits for the main loop. For big buffers, main memory or L3 as the bottleneck may hide the penalty for cacheline splits. If anyone has experimental data to back this up for any real code they've tuned, that's useful information too.

VMASKMOVPS does look usable for Intel targets. (The SSE version is horrible, with an implicit non-temporal hint, but the AVX version doesn't have that. There's even a new intrinsic to make sure you don't get the SSE version for 128b operands: _mm128_maskstore_ps) The AVX version is only a little bit slow on Haswell:

3 uops / 4c latency / 1-per-2c throughput as a load.
4 uops / 14c latency / 1-per-2c throughput as a 256b store.
4 uops / 13c latency / 1-per-1c throughput as a 128b store.

The store form is still unusably slow on AMD CPUs, both Jaguar (1 per 22c tput) and Bulldozer-family: 1 per 16c on Steamroller (similar on Bulldozer), or 1 per ~180c throughput on Piledriver.

But if we do want to use VMASKMOVPS, we need a vector with the high bit set in each element that should actually be loaded/stored. PALIGNR and PSRLDQ (for use on a vector of all-ones) only take compile-time-constant counts.

Notice that the other bits don't matter: it doesn't have to be all-ones, so scattering some set bits out to the high bits of the elements is a possibility.

Camphene answered 16/12, 2015 at 8:16 Comment(5)

You don't actually need to use the masked instructions for this kind of thing. Instead, use a single unaligned operation at each end of the buffer; those are guaranteed to overlap an aligned location. This results in some duplicate work, but no masks and no conditional branches. You just need to be careful about ordering loads and stores. – Conall 16/12, 2015 at 18:25

@StephenCanon: Oh right, pipelining the loop so a load for the next iteration happens before the store for the current iteration would let this work even for an algo that rewrites the same buffer. I should have picked an example that sums the array, where you need to only count each element once. (although that can be done by masking in registers, rather than masked loads/stores.) Hmm, any ideas coming up with a simple example where neither overlapping stores nor masking in registers is sufficient? – Camphene 16/12, 2015 at 18:37

The canonical example is when you have to deal with multiple buffers that aren't similarly aligned. Throw up your hands and use unaligned loads. – Conall 16/12, 2015 at 18:43

@StephenCanon: Right, because you need a[i] to line up with b[i]. No ideas on any possible use-case for VMOVMASKPS? I guess these mask-generation ideas are also useful for a reduction where you need to count each element exactly once, but then you use the mask with ANDPS in registers, not VMOVMASKPS. – Camphene 16/12, 2015 at 18:52

MOVMASKPS is very much one of those "it seemed like a good idea at the time" things AFAICT. I've never used it. – Conall 16/12, 2015 at 19:2

C

3

AVX-only: Unaligned accesses at the start/end, pipelining loads to avoid problems when rewriting in place.

Thanks to @StephenCanon for pointing out that this is better than VMASKMOVPS for anything that VMASKMOVPS could do to help with looping over unaligned buffers.

This is maybe a bit much to expect a compiler to do as a loop transformation, esp. since the obvious way can make Valgrind unhappy (see below).

section .text
global floatmul   ; (float *rdi)
floatmul:

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    vmovups  ymm0, [rdi]        ; first vector
    vaddps   ymm0, ymm0, ymm0   ; *= 2.0
    ; don't store yet

    lea    rax, [rdi+32]
    and    rax, ~0x1c           ; 0x1c = 7 << 2 = 0b11100  ; clear those bits.
    vmovups  ymm1, [rax]        ; first aligned vector, for use by first loop iteration

    vmovups  [rdi], ymm0        ; store the first unaligned vector
    vmovups  ymm0, [rdx]        ; load the *last* unaligned vector
    
.loop:
      ;; on entry: [rax] is already loaded into ymm1
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rax]              ; vmovaps would fault if p%4 != 0
    add      rax, 32
    vmovups  ymm1, [rax]
    cmp      rax, rdx           ; while( (p+=8) < (endp-8) );
    jb  .loop

    ; discard ymm1.  It includes data from beyond the end of the array (aligned case: same as ymm0)

    vaddps   ymm0, ymm0, ymm0   ; the last 32B, which we loaded before the loop
    vmovups  [rdx], ymm0
    ret

    ;   End alignment:
    ; a[] = XXXX XXXX ABCD E___    _ = garbage past the end
    ;                  ^rdx
    ;       ^rax ^rax ^rax ^rax(loop exit)

    ;   ymm0 = BCDE
    ;   ymm1 loops over ..., XXXX, ABCD, E___
    ;   The last load off the end of the array includes garbage
    ;     because we pipeline the load for the next iteration

Doing a load from the end of the array at the start of the loop seems a little weird, but hopefully it doesn't confuse the hardware prefetchers, or slow down getting the beginning of the array streaming from memory.

Overhead:

2 extra integer uops total (to set up the aligned-start). We're already using the end pointer for the normal loop structure, so that's free.
2 extra copies of the loop body (load/calc/store). (First and last iteration peeled).

Compilers probably won't be happy about emitting code like this, when auto-vectorizing. Valgrind will report accesses outside of array bounds, and does so by single-stepping and decoding instructions to see what they're accessing. So merely staying within the same page (and cache line) as the last element in the array isn't sufficient. Also note that if the input pointer isn't 4B-aligned, we can potentially read into another page and segfault.

To keep Valgrind happy, we could stop the loop two vector widths early, to do the special-case load of the unaligned last vector-width of the array. That would require duplicating the loop body an extra time (insignificant in this example, but it's trivial on purpose.) Or maybe avoid pipelining by having the intro code jump into the middle of the loop. (That may be sub-optimal for the uop-cache, though: (parts of) the loop body may end up in the uop cache twice.)

TODO: write a version that jumps into the loop mid-way.

Camphene answered 16/12, 2015 at 21:27 Comment(4)

I'm curious - why would jumping into the loop mid-way end up with duplicate uops in the cache? I thought every instruction mapped to at most one chunk of uops in the cache. – Maui 21/6, 2016 at 19:59

@BeeOnRope: the uop cache is like a trace cache: it's more like a cache of basic blocks than just decoded instructions, to reduce the impact of taken branches. IDK if this is related, but predicted-not-taken conditional branches can run on more ports than predicted-taken branches, on Haswell and later. Agner Fog's microarch pdf says something about it. I haven't done tests, and have no idea when you'd actually get the same code in different uop cache lines, but Agner says it's possible and can happen in practice. – Camphene 21/6, 2016 at 20:11

That's pretty interesting. I knew the earlier NetBurst cache was more like that, but I thought the uop was closer to a traditional icache (except, of course, that it holds uops). So what if you have something like a switch() with all cases small and fallthru (e.g., a duff's device), compiled to small indirect jumps forward? After the switch, control flow converges, of course. If you had 10 possible targets and were using all it seems like you'd get 10 entries in the uop cache, even including all the common code after the switch (at least until the next branch?). Or is merging possible? – Maui 21/6, 2016 at 20:40

After reading that part of Agner's microarch doc again I think I get it. Each block with a unique entry point (e.g., jmp target) will create a new entry in the uop cache, but every time you pass a 32 byte boundary in the code, it might merge up with the existing entries. So if the code in the switch table is small (in particular, if all jump targets are in the same 32B block), each case is likely to get its own entry, but then all share the entry following the case (and subsequent), once they pass into the next 32B block. So you lose a fair amount, but it's not catastrophic. – Maui 21/6, 2016 at 21:14

C

8

You could turn an integer bitmask like (1 << (vector1.size() & 3)) - 1 into a vector mask on the fly with is there an inverse instruction to the movemask instruction in intel avx2? Or:

Load a mask for VMOVMASKPS from a window into a table. AVX2, or AVX1 with a few extra instructions or a larger table.

The mask can also be used for ANDPS in registers in a reduction that needs to count each element exactly once. As Stephen Canon points out in comments on the OP, pipelining loads can allow overlapping unaligned stores to work even for a rewrite-in-place function like the example I picked, so VMASKMOVPS is NOT the best choice here.

This should be good on Intel CPUs, esp. Haswell and later for AVX2.

Agner Fog's method for getting a pshufb mask actually provided an idea that is very efficient: do an unaligned load taking a window of data from a table. Instead of a giant table of masks, use an index as a way of doing a byte-shift on data in memory.

Masks in LSB-first byte order (as they're stored in memory), not the usual notation for {X3,X2,X1,X0} elements in a vector. As written, they line up with an aligned window including the start/end of the input array in memory.

start misalign count = 0: mask = all-ones (Aligned case)
start misalign count = 1: mask = {0,-1,-1,-1,-1,-1,-1,-1} (skip one in the first 32B)
...
start misalign count = 7: mask = {0, 0, 0, 0, 0, 0, 0,-1} (skip all but one in the first 32B)
end misalign count = 0: no trailing elements. mask = all-ones (Aligned case).
this is the odd case, not similar to count=1. A couple extra instructions for this special case is worth avoiding an extra loop iteration and a cleanup with a mask of all-zeros.
end misalign count = 1: one trailing element. mask = {-1, 0, 0, 0, 0, 0, 0, 0}
...
end misalign count = 7: seven trailing elems. mask = {-1,-1,-1,-1,-1,-1,-1, 0}

Untested code, assume there are bugs

section .data
align 32  ; preferably no cache-line boundaries inside the table

; byte elements, to be loaded with pmovsx. all-ones sign-extends
    DB  0,  0,  0,  0,  0,  0,  0,  0
masktable_intro:                      ; index with 0..-7
    DB -1, -1, -1, -1, -1, -1, -1, -1
masktable_outro:                      ; index with -8(aligned), or -1..-7
    DB  0,  0,  0,  0,  0,  0,  0,  0

; the very first and last 0 bytes are not needed, since we avoid an all-zero mask.


section .text
global floatmul   ; (float *rdi)
floatmul:
    mov    eax, edi
    and    eax, 0x1c            ; 0x1c = 7 << 2 = 0b11100

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    and    rdi, ~0x1c           ; Leave the low 2 bits alone, so this still works on misaligned floats.

    shr    eax, 2               ; misalignment-count, in the range [0..7]

    neg        rax
    vpmovsxbd  ymm0, [masktable_intro + rax]  ; Won't link on OS X: Need a separate LEA for RIP-relative

    vmaskmovps ymm1, ymm0, [rdi]
    vaddps     ymm1, ymm1, ymm1   ; *= 2.0
    vmaskmovps [rdi], ymm0, ymm1

    ;;; also prepare the cleanup mask while the table is still hot in L1 cache

    ; if the loop count known to be a multiple of the vector width,
    ; the alignment of the end will be the same as the alignment of the start
    ; so we could just invert the mask
    ; vpxor    xmm1, xmm1, xmm1     ; doesn't need an execution unit
    ; vpcmpeqd ymm0, ymm1, ymm0

    ; In the more general case: just re-generate the mask from the one-past-the-end addr
    mov    eax, edx
    xor    ecx, ecx      ; prep for setcc
    and    eax, 0x1c     ; sets ZF when aligned
    setz   cl            ; rcx=1 in the aligned special-case, else 0
    shr    eax, 2
    lea    eax, [rax + rcx*8]   ; 1..7, or 8 in the aligned case
    neg    rax
    vpmovsxbd  ymm0, [masktable_outro + rax]


.loop:
    add      rdi, 32
    vmovups  ymm1, [rdi]    ; Or vmovaps if you want to fault if the address isn't 4B-aligned
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rdi], ymm1
    cmp      rdi, rdx           ; while( (p+=8) < (start+1024-8) )
    jb    .loop        ; 5 fused-domain uops, yuck.

    ; use the outro mask that we generated before the loop for insn scheduling / cache locality reasons.

    vmaskmov ymm1, ymm0, [rdi]
    vaddps     ymm1, ymm1, ymm1   ; *= 2.0
    vmaskmovps [rdi], ymm0, ymm1
    ret


    ; vpcmpeqd ymm1, ymm1, ymm1   ; worse way to invert the mask: dep-chain breaker but still needs an execution unit to make all-ones instead of all-zeros.
    ; vpxor    ymm0, ymm0, ymm1

This does require a load from a table, which can miss in L1 cache, and 15B of table data. (Or 24B if the loop count is also variable, and we have to generate the end mask separately).

Either way, after the 4 instructions to generate the misalignment-count and the aligned start address, getting the mask only takes a single vpmosvsxbd instruction. (The ymm, mem form can't micro-fuse, so it's 2 uops). This requires AVX2.

Without AVX2:

vmovdqu from a 60B table of DWORDs (DD) instead of Bytes (DB). This would actually save an insn relative to AVX2: address & 0x1c is the index, without needing a right-shift by two. The whole table still fits in a cache line, but without room for other constants the algo might use.

Or:

2x vpmovsxbd into two 128b registers ([masktable_intro + rax] and [masktable_intro + rax + 4])
vinsertf128

Or: (more insns, and more shuffle-port pressure, but less load-port pressure, almost certainly worse)

vpmovsxbw into a 128b register
vpunpcklwd / vpunpckhwd into two xmm regs (src1=src2 for both)
vinsertf128

Overhead:

Integer ops: 5 uops at the start to get an index and align the start pointer. 7 uops to get the index for the end mask. Total of 12 GP-register uops beyond simply using unaligned, if the loop element count is a multiple of the vector width.
AVX2: Two 2-fused-domain-uop vector insns to go from [0..7] index in a GP reg to a mask in a YMM reg. (One for the start mask, one for the end mask). Uses a 24B table, accessed in an 8B window with byte granularity.
AVX: Six 1-fused-domain-uop vector insns (three at the start, three at the end). With RIP-relative addressing for the table, four of those instructions will be [base+index] and won't micro-fuse, so an extra two integer insns might be better.

The code inside the loop is replicated 3 times.

TODO: write another answer generating the mask on the fly, maybe as bytes in a 64b reg, then unpacking it to 256b. Maybe with a bit-shift, or BMI2's BZHI(-1, count)?

Camphene answered 16/12, 2015 at 8:16 Comment(4)

On the topic of the unalgined approaches versus the various aligned approaches, it probably (as usual) comes down to your expectations for your input. Often I expect that data will be aligned (e.g., because people are using allocators that align it, or whatever) - but I can't exclude the possibility of unaligned input in a general purpose shared library. In that case, the unaligned approach probably dominates, since you don't actually pay the misaligned penalty of cache-line crossing (you may still pay a penalty for using unaligned mov* instructions on some older CPUs). – Maui 21/6, 2016 at 20:49

On the other hand, if your data is often misaligned, I think the aligning approaches will work well, for moderately sized data, anyway (e.g., 100+ bytes up to perhaps around the size of the L1). For small sizes, the intro/outro overhead may make it less competitive. For large sizes, the unaligned approach may catch up again (but not be better) since higher level cache or memory access may come to dominate the time (but not always for L2 or L3 given good prefetching). – Maui 21/6, 2016 at 20:56

A reasonable approach in some cases, where you expect one behavior to dominate (e.g., usually aligned vs usually unaligned), would be to have both versions, with a quick check at front which selects one or other other. If well predicted, this is close to free. It would be useful for a widely shared library where you don't know the behavior, but believe it is likely that any given user will be primarily be in one regime or the other. Where that would really fail, however, is in a system where say 50% of inputs are aligned. – Maui 21/6, 2016 at 21:0

Unfortunately, that's pretty likely in some systems, because memory allocators often align by 8 (50% chance of being aligned for 16B) and/or 16 (same 50% chance of alignment for 256bit vectors). Of course, you pay a significant cost in code complexity and validation effort (and code size, in case both paths are taken). – Maui 21/6, 2016 at 21:2

C

3

AVX-only: Unaligned accesses at the start/end, pipelining loads to avoid problems when rewriting in place.

Thanks to @StephenCanon for pointing out that this is better than VMASKMOVPS for anything that VMASKMOVPS could do to help with looping over unaligned buffers.

This is maybe a bit much to expect a compiler to do as a loop transformation, esp. since the obvious way can make Valgrind unhappy (see below).

section .text
global floatmul   ; (float *rdi)
floatmul:

    lea    rdx, [rdi + 4096 - 32]  ; one full vector less than the end address (calculated *before* masking for alignment).
                ;;  replace 4096 with rsi*4 if rsi has the count (in floats, not bytes)

    vmovups  ymm0, [rdi]        ; first vector
    vaddps   ymm0, ymm0, ymm0   ; *= 2.0
    ; don't store yet

    lea    rax, [rdi+32]
    and    rax, ~0x1c           ; 0x1c = 7 << 2 = 0b11100  ; clear those bits.
    vmovups  ymm1, [rax]        ; first aligned vector, for use by first loop iteration

    vmovups  [rdi], ymm0        ; store the first unaligned vector
    vmovups  ymm0, [rdx]        ; load the *last* unaligned vector
    
.loop:
      ;; on entry: [rax] is already loaded into ymm1
    vaddps   ymm1, ymm1, ymm1   ; *= 2.0
    vmovups  [rax]              ; vmovaps would fault if p%4 != 0
    add      rax, 32
    vmovups  ymm1, [rax]
    cmp      rax, rdx           ; while( (p+=8) < (endp-8) );
    jb  .loop

    ; discard ymm1.  It includes data from beyond the end of the array (aligned case: same as ymm0)

    vaddps   ymm0, ymm0, ymm0   ; the last 32B, which we loaded before the loop
    vmovups  [rdx], ymm0
    ret

    ;   End alignment:
    ; a[] = XXXX XXXX ABCD E___    _ = garbage past the end
    ;                  ^rdx
    ;       ^rax ^rax ^rax ^rax(loop exit)

    ;   ymm0 = BCDE
    ;   ymm1 loops over ..., XXXX, ABCD, E___
    ;   The last load off the end of the array includes garbage
    ;     because we pipeline the load for the next iteration

Doing a load from the end of the array at the start of the loop seems a little weird, but hopefully it doesn't confuse the hardware prefetchers, or slow down getting the beginning of the array streaming from memory.

Overhead:

2 extra integer uops total (to set up the aligned-start). We're already using the end pointer for the normal loop structure, so that's free.
2 extra copies of the loop body (load/calc/store). (First and last iteration peeled).

Compilers probably won't be happy about emitting code like this, when auto-vectorizing. Valgrind will report accesses outside of array bounds, and does so by single-stepping and decoding instructions to see what they're accessing. So merely staying within the same page (and cache line) as the last element in the array isn't sufficient. Also note that if the input pointer isn't 4B-aligned, we can potentially read into another page and segfault.

To keep Valgrind happy, we could stop the loop two vector widths early, to do the special-case load of the unaligned last vector-width of the array. That would require duplicating the loop body an extra time (insignificant in this example, but it's trivial on purpose.) Or maybe avoid pipelining by having the intro code jump into the middle of the loop. (That may be sub-optimal for the uop-cache, though: (parts of) the loop body may end up in the uop cache twice.)

TODO: write a version that jumps into the loop mid-way.

Camphene answered 16/12, 2015 at 21:27 Comment(4)

I'm curious - why would jumping into the loop mid-way end up with duplicate uops in the cache? I thought every instruction mapped to at most one chunk of uops in the cache. – Maui 21/6, 2016 at 19:59

@BeeOnRope: the uop cache is like a trace cache: it's more like a cache of basic blocks than just decoded instructions, to reduce the impact of taken branches. IDK if this is related, but predicted-not-taken conditional branches can run on more ports than predicted-taken branches, on Haswell and later. Agner Fog's microarch pdf says something about it. I haven't done tests, and have no idea when you'd actually get the same code in different uop cache lines, but Agner says it's possible and can happen in practice. – Camphene 21/6, 2016 at 20:11

That's pretty interesting. I knew the earlier NetBurst cache was more like that, but I thought the uop was closer to a traditional icache (except, of course, that it holds uops). So what if you have something like a switch() with all cases small and fallthru (e.g., a duff's device), compiled to small indirect jumps forward? After the switch, control flow converges, of course. If you had 10 possible targets and were using all it seems like you'd get 10 entries in the uop cache, even including all the common code after the switch (at least until the next branch?). Or is merging possible? – Maui 21/6, 2016 at 20:40

After reading that part of Agner's microarch doc again I think I get it. Each block with a unique entry point (e.g., jmp target) will create a new entry in the uop cache, but every time you pass a 32 byte boundary in the code, it might merge up with the existing entries. So if the code in the switch table is small (in particular, if all jump targets are in the same 32B block), each case is likely to get its own entry, but then all share the entry following the case (and subsequent), once they pass into the next 32B block. So you lose a fair amount, but it's not catastrophic. – Maui 21/6, 2016 at 21:14

AVX-only: Unaligned accesses at the start/end, pipelining loads to avoid problems when rewriting in place.

Overhead:

Load a mask for VMOVMASKPS from a window into a table. AVX2, or AVX1 with a few extra instructions or a larger table.

Without AVX2:

Overhead:

AVX-only: Unaligned accesses at the start/end, pipelining loads to avoid problems when rewriting in place.

Overhead:

Recommended topics

Hot tags