AVX2 byte gather with uint16 indices, into a __m256i

Asked 23/10, 2017 at 3:46 Answered 23/10, 2017 at 8:25

I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code:

char array[];         // different array every time.
uint16_t offset[32];  // same offset reused many times


_mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]],
      array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]], array[offset[15]], 
      array[offset[16]],array[offset[17]], array[offset[18]], array[offset[19]], array[offset[20]], array[offset[21]], array[offset[22]], array[offset[23]], 
      array[offset[24]],array[offset[25]],array[offset[26]], array[offset[27]], array[offset[28]], array[offset[29]], array[offset[30]],array[offset[31]])

This function will be called many times with the same offsets and different arrays. But I don't think it is optimal according to my test. Is there any idea to improve it?

Warmedover answered 23/10, 2017 at 3:46 Comment(17)

So you need to do a byte gather? Or do you just need a byte shuffle? What types are offset[] and array[]? uint8_t array[]? And is either of offset[] or array[] a compile-time constant? – Submit 23/10, 2017 at 5:2

@Peter, a byte gather. Types of offset[] and array[] are uint16_t and char respectively. None of them is compile-time constant. – Warmedover 23/10, 2017 at 6:5

What compiler are you using currently, with what options? (i.e. what baseline asm code-gen are you starting from for the set_epi8?) – Submit 23/10, 2017 at 6:12

And what CPU(s) do you care about tuning for? When you say "many times", is that enough times to be worth JIT-compiling a gather function for? Even on Skylake, I'm not sure multiple vpgatherdd + byte blends would be better than just scalar integer loads + shifts to build up 32-bit or 64-bit elements before inserting those into vectors. Especially if the load indices could be compile-time constants so you didn't have to spend any instructions loading or unpacking the offsets. – Submit 23/10, 2017 at 6:15

How big is array? Do you get a lot of cache misses doing this? – Submit 23/10, 2017 at 6:32

Is there any pattern between different array inputs? e.g. do you use the same offsets for array and array+32? Are the neighbouring bytes from an offset maybe useful for the next gather? Is it common for two offset indices to take bytes from the same dword element? Or especially for any adjacent bytes in the gather result to actually come from adjacent locations in memory, so a JIT-compiled gather could do a 16-bit or 32-bit load and get multiple bytes in the correct order? – Submit 23/10, 2017 at 6:37

@Peter, compiled by gcc version 4.9.2 (GCC) with option -mavx2 -O2. cpu info:" Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz", without avx512 flags. offset is same for about more than 100 arraies, but not compile-time constant. Maybe we could make use of this, but I am not quite familiar with avx2 – Warmedover 23/10, 2017 at 6:38

Is it common for multiple bytes to come from the same 32B chunk of array? If so, some 32B loads + pshufb could be good. – Submit 23/10, 2017 at 6:38

Ok, the same offset for only about 100 gathers. That might be too few to be worth JIT-compiling the gather to take advantage of special properties of that particular offset[] unless there tend to be a lot of nearby elements (which would make a custom sequence of instructions hard-coding that gather much more efficient than a generic version). – Submit 23/10, 2017 at 6:42

BTW, that's a pretty old gcc version. It might not even be newer than your CPU, so it's probably not as good at optimizing for it as gcc6.4. (Also worth trying clang5.0, and maybe gcc7.2). – Submit 23/10, 2017 at 6:44

Also, you should compile with -march=native if you're making binaries specifically for that server. – Submit 23/10, 2017 at 6:50

I'd bet that "dumb" JIT compiling could certainly work for even 100 calls with the same offset given that it would cut the number of loads in half since you could just hardcode the values of offset[...]. Of course, your JIT has to be fast to accomplish this: one problem could be that you'd have to generate 8 or 32 byte offsets, depending. It might just be faster to use all 32 byte offsets. Of course any kind of JIT is a pretty extreme, non-portable solution, so go that path only if this really matters. – Maudiemaudlin 23/10, 2017 at 6:54

gcc just generates absolutely terrible code. It reads half of the elements in the offset array, then immediately stores them back (as 64-bit values) on the stack, only to read them back off later in the function. Overall it uses more than twice the number of instructions as clang and icc which take a reasonable approach of reading each offset and then using a memory source vpinsrb. It seems that would simultaneously bottleneck on the 2 read ports and port 5 (shuffle) at 1 element per cycle, so about 32 cycles plus some minor fixed overhead. – Maudiemaudlin 23/10, 2017 at 7:22

I also doubt the 2 reads(array, offset) take a lot of time. Can I avoid reading offset every time, only at the beginning of the 100 calls. – Warmedover 23/10, 2017 at 8:14

@Warmedover - well the first problem is the bad gcc code generation, see my answer for some things that will help. Without generating code at runtime, it's hard to see how you can avoid reading offset every time through. I mean if it was more complicated than an offset you could pre-process your read pattern, and get a list of array offsets that you will read from - but that's what you are starting with here. – Maudiemaudlin 23/10, 2017 at 8:28

@BeeOnRope: with 1k or 10k reuses of offset[], it could be worth invoking a general-purpose JIT like LLVM to generate a sequence of vmovd / vpinsrb instructions. But for only 100, yeah you'd have to code it up yourself. Good point about fixed instruction width, though. vpinsrb with a 2-byte VEX and a [base + disp32] addressing mode is 10 bytes. You could JIT this pattern with an xmm load from offset[], a couple vpshufb to line up the offsets into rel32 slots and zero the rest, then vpor from a template of instructions with [rdi + strict dword 0] addressing modes. – Submit 24/10, 2017 at 3:28

Exactly. The generated code pattern is simple enough that a codegen wouldn't be too hard. However if the elements had some redundancy then a more general purpose compiler might help. A general purpose see compiler might also make the code more portable and make porting to ISAs extensions trivial. – Maudiemaudlin 24/10, 2017 at 3:31

Let's look first at solutions that work for a general offset that varies with every call (which will be a drop-in solution for the existing function), and then after we'll see if we can take advantage of the same offset array being used used for several calls (while array always varies).

Varying Offset

Well one big problem is that gcc (old or new) just generates awful code for the current implementation of your function:

  lea r10, [rsp+8]
  and rsp, -32
  push QWORD PTR [r10-8]
  push rbp
  mov rbp, rsp
  push r15
  push r14
  push r13
  push r12
  push r10
  push rbx
  sub rsp, 40
  movzx eax, WORD PTR [rsi+40]
  movzx r14d, WORD PTR [rsi+60]
  movzx r12d, WORD PTR [rsi+56]
  movzx ecx, WORD PTR [rsi+44]
  movzx r15d, WORD PTR [rsi+62]
  movzx r13d, WORD PTR [rsi+58]
  mov QWORD PTR [rbp-56], rax
  movzx eax, WORD PTR [rsi+38]
  movzx ebx, WORD PTR [rsi+54]
  movzx r11d, WORD PTR [rsi+52]
  movzx r10d, WORD PTR [rsi+50]
  movzx r9d, WORD PTR [rsi+48]
  movzx r8d, WORD PTR [rsi+46]
  mov QWORD PTR [rbp-64], rax
  movzx eax, WORD PTR [rsi+36]
  movzx edx, WORD PTR [rsi+42]
  mov QWORD PTR [rbp-72], rax
  movzx eax, WORD PTR [rsi+34]
  mov QWORD PTR [rbp-80], rax
  movzx eax, WORD PTR [rsi+32]
  mov QWORD PTR [rbp-88], rax
  movzx eax, WORD PTR [rsi+30]
  movzx r15d, BYTE PTR [rdi+r15]
  mov QWORD PTR [rbp-96], rax
  movzx eax, WORD PTR [rsi+28]
  vmovd xmm2, r15d
  vpinsrb xmm2, xmm2, BYTE PTR [rdi+r14], 1
  mov QWORD PTR [rbp-104], rax
  movzx eax, WORD PTR [rsi+26]
  mov QWORD PTR [rbp-112], rax
  movzx eax, WORD PTR [rsi+24]
  mov QWORD PTR [rbp-120], rax
  movzx eax, WORD PTR [rsi+22]
  mov QWORD PTR [rbp-128], rax
  movzx eax, WORD PTR [rsi+20]
  mov QWORD PTR [rbp-136], rax
  movzx eax, WORD PTR [rsi+18]
  mov QWORD PTR [rbp-144], rax
  movzx eax, WORD PTR [rsi+16]
  mov QWORD PTR [rbp-152], rax
  movzx eax, WORD PTR [rsi+14]
  mov QWORD PTR [rbp-160], rax
  movzx eax, WORD PTR [rsi+12]
  mov QWORD PTR [rbp-168], rax
  movzx eax, WORD PTR [rsi+10]
  mov QWORD PTR [rbp-176], rax
  movzx eax, WORD PTR [rsi+8]
  mov QWORD PTR [rbp-184], rax
  movzx eax, WORD PTR [rsi+6]
  mov QWORD PTR [rbp-192], rax
  movzx eax, WORD PTR [rsi+4]
  mov QWORD PTR [rbp-200], rax
  movzx eax, WORD PTR [rsi+2]
  movzx esi, WORD PTR [rsi]
  movzx r13d, BYTE PTR [rdi+r13]
  movzx r8d, BYTE PTR [rdi+r8]
  movzx edx, BYTE PTR [rdi+rdx]
  movzx ebx, BYTE PTR [rdi+rbx]
  movzx r10d, BYTE PTR [rdi+r10]
  vmovd xmm7, r13d
  vmovd xmm1, r8d
  vpinsrb xmm1, xmm1, BYTE PTR [rdi+rcx], 1
  mov rcx, QWORD PTR [rbp-56]
  vmovd xmm5, edx
  vmovd xmm3, ebx
  mov rbx, QWORD PTR [rbp-72]
  vmovd xmm6, r10d
  vpinsrb xmm7, xmm7, BYTE PTR [rdi+r12], 1
  vpinsrb xmm5, xmm5, BYTE PTR [rdi+rcx], 1
  mov rcx, QWORD PTR [rbp-64]
  vpinsrb xmm6, xmm6, BYTE PTR [rdi+r9], 1
  vpinsrb xmm3, xmm3, BYTE PTR [rdi+r11], 1
  vpunpcklwd xmm2, xmm2, xmm7
  movzx edx, BYTE PTR [rdi+rcx]
  mov rcx, QWORD PTR [rbp-80]
  vpunpcklwd xmm1, xmm1, xmm5
  vpunpcklwd xmm3, xmm3, xmm6
  vmovd xmm0, edx
  movzx edx, BYTE PTR [rdi+rcx]
  mov rcx, QWORD PTR [rbp-96]
  vpunpckldq xmm2, xmm2, xmm3
  vpinsrb xmm0, xmm0, BYTE PTR [rdi+rbx], 1
  mov rbx, QWORD PTR [rbp-88]
  vmovd xmm4, edx
  movzx edx, BYTE PTR [rdi+rcx]
  mov rcx, QWORD PTR [rbp-112]
  vpinsrb xmm4, xmm4, BYTE PTR [rdi+rbx], 1
  mov rbx, QWORD PTR [rbp-104]
  vpunpcklwd xmm0, xmm0, xmm4
  vpunpckldq xmm0, xmm1, xmm0
  vmovd xmm1, edx
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm1, xmm1, BYTE PTR [rdi+rbx], 1
  mov rcx, QWORD PTR [rbp-128]
  mov rbx, QWORD PTR [rbp-120]
  vpunpcklqdq xmm0, xmm2, xmm0
  vmovd xmm8, edx
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm8, xmm8, BYTE PTR [rdi+rbx], 1
  mov rcx, QWORD PTR [rbp-144]
  mov rbx, QWORD PTR [rbp-136]
  vmovd xmm4, edx
  vpunpcklwd xmm1, xmm1, xmm8
  vpinsrb xmm4, xmm4, BYTE PTR [rdi+rbx], 1
  movzx edx, BYTE PTR [rdi+rcx]
  mov rbx, QWORD PTR [rbp-152]
  mov rcx, QWORD PTR [rbp-160]
  vmovd xmm7, edx
  movzx eax, BYTE PTR [rdi+rax]
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm7, xmm7, BYTE PTR [rdi+rbx], 1
  mov rcx, QWORD PTR [rbp-176]
  mov rbx, QWORD PTR [rbp-168]
  vmovd xmm5, eax
  vmovd xmm2, edx
  vpinsrb xmm5, xmm5, BYTE PTR [rdi+rsi], 1
  vpunpcklwd xmm4, xmm4, xmm7
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm2, xmm2, BYTE PTR [rdi+rbx], 1
  vpunpckldq xmm1, xmm1, xmm4
  mov rbx, QWORD PTR [rbp-184]
  mov rcx, QWORD PTR [rbp-192]
  vmovd xmm6, edx
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm6, xmm6, BYTE PTR [rdi+rbx], 1
  mov rbx, QWORD PTR [rbp-200]
  vmovd xmm3, edx
  vpunpcklwd xmm2, xmm2, xmm6
  vpinsrb xmm3, xmm3, BYTE PTR [rdi+rbx], 1
  add rsp, 40
  vpunpcklwd xmm3, xmm3, xmm5
  vpunpckldq xmm2, xmm2, xmm3
  pop rbx
  pop r10
  vpunpcklqdq xmm1, xmm1, xmm2
  pop r12
  pop r13
  vinserti128 ymm0, ymm0, xmm1, 0x1
  pop r14
  pop r15
  pop rbp
  lea rsp, [r10-8]
  ret

Basically it's trying to do all the reads of offset up front, and just runs out of registers, so it starts spilling a few and then goes on an orgy of spilling where it's just reading most of the 16-bit elements of offset and then immediately storing them (as 64-bit zero-extended values) immediately on to the stack. Essentially it's copying most of the offset array (with zero extension to 64-bits) for no purpose: where it later reads the spilled values it could have of course just read from offset.

This same terrible code is evident in the old 4.9.2 version you're using as well as the very recent 7.2.

Neither icc nor clang have any such issues - they both generate almost identical quite reasonable code that just reads once from every offset position using movzx and then inserts the byte using vpinsrb with a memory source operand based on the offset just read:

gather256(char*, unsigned short*): # @gather256(char*, unsigned short*)
  movzx eax, word ptr [rsi + 30]
  movzx eax, byte ptr [rdi + rax]
  vmovd xmm0, eax
  movzx eax, word ptr [rsi + 28]
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 1
  movzx eax, word ptr [rsi + 26]
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 2
  movzx eax, word ptr [rsi + 24]
  ...
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 14
  movzx eax, word ptr [rsi]
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 15
  movzx eax, word ptr [rsi + 62]
  movzx eax, byte ptr [rdi + rax]
  vmovd xmm1, eax
  movzx eax, word ptr [rsi + 60]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 1
  movzx eax, word ptr [rsi + 58]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 2
  movzx eax, word ptr [rsi + 56]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 3
  movzx eax, word ptr [rsi + 54]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 4
  movzx eax, word ptr [rsi + 52]
  ...
  movzx eax, word ptr [rsi + 32]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 15
  vinserti128 ymm0, ymm1, xmm0, 1
  ret

Very nice. There is a small amount of additional overhead to vinserti128 two xmm vectors together each with half of the result, apparently because vpinserb can't write to the high 128-bits. It seems that on modern uarchs like the one you are using this would simultaneously bottleneck on the 2 read ports and port 5 (shuffle) at 1 element per cycle. So this will probably have a throughput of about 1 per 32 cycles, and a latency close to 32 cycles (the main dependence chain is through the work-in-progress xmm register that is receiving the pinsrb but the listed latency for the memory-source version of this instruction is only 1 cycle¹.

Can we get close to this 32 performance on gcc? It seems so. Here's one approach:

uint64_t gather64(char *array, uint16_t *offset) {
  uint64_t ret;
  char *p = (char *)&ret;
  p[0] = array[offset[0]];
  p[1] = array[offset[1]];
  p[2] = array[offset[2]];
  p[3] = array[offset[3]];
  p[4] = array[offset[4]];
  p[5] = array[offset[5]];
  p[6] = array[offset[6]];
  p[7] = array[offset[7]];
  return ret;
}

__m256i gather256_gcc(char *array, uint16_t *offset) {

  return _mm256_set_epi64x(
    gather64(array, offset),
    gather64(array +  8, offset + 8),
    gather64(array + 16, offset + 16),
    gather64(array + 24, offset + 24)
  );
}

Here we rely on a temporary array on the stack to gather 8 elements from array at a time, and then we use that as input into _mm256_set_epi64x. Overall this uses 2 loads and 1 store per 8-byte element, and a couple extra instructions for every 64-bit element, so it should be close to 1 cycle per element throughput².

It generates the "expected" inlined code in gcc:

gather256_gcc(char*, unsigned short*):
  lea r10, [rsp+8]
  and rsp, -32
  push QWORD PTR [r10-8]
  push rbp
  mov rbp, rsp
  push r10
  movzx eax, WORD PTR [rsi+48]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-24], al
  movzx eax, WORD PTR [rsi+50]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-23], al
  movzx eax, WORD PTR [rsi+52]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-22], al
  ...
  movzx eax, WORD PTR [rsi+62]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-17], al
  movzx eax, WORD PTR [rsi+32]
  vmovq xmm0, QWORD PTR [rbp-24]
  movzx eax, BYTE PTR [rdi+16+rax]
  movzx edx, WORD PTR [rsi+16]
  mov BYTE PTR [rbp-24], al
  movzx eax, WORD PTR [rsi+34]
  movzx edx, BYTE PTR [rdi+8+rdx]
  movzx eax, BYTE PTR [rdi+16+rax]
  mov BYTE PTR [rbp-23], al
  ...
  movzx eax, WORD PTR [rsi+46]
  movzx eax, BYTE PTR [rdi+16+rax]
  mov BYTE PTR [rbp-17], al
  mov rax, QWORD PTR [rbp-24]
  mov BYTE PTR [rbp-24], dl
  movzx edx, WORD PTR [rsi+18]
  vpinsrq xmm0, xmm0, rax, 1
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-23], dl
  movzx edx, WORD PTR [rsi+20]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-22], dl
  movzx edx, WORD PTR [rsi+22]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-21], dl
  movzx edx, WORD PTR [rsi+24]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-20], dl
  movzx edx, WORD PTR [rsi+26]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-19], dl
  movzx edx, WORD PTR [rsi+28]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-18], dl
  movzx edx, WORD PTR [rsi+30]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-17], dl
  movzx edx, WORD PTR [rsi]
  vmovq xmm1, QWORD PTR [rbp-24]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-24], dl
  movzx edx, WORD PTR [rsi+2]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-23], dl
  movzx edx, WORD PTR [rsi+4]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-22], dl
  ...
  movzx edx, WORD PTR [rsi+12]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-18], dl
  movzx edx, WORD PTR [rsi+14]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-17], dl
  vpinsrq xmm1, xmm1, QWORD PTR [rbp-24], 1
  vinserti128 ymm0, ymm0, xmm1, 0x1
  pop r10
  pop rbp
  lea rsp, [r10-8]
  ret

This approach will suffer 4 (non-dependent) store forwarding stalls when trying to read the stack buffer, which will make the latency somewhat worse than 32 cycles, perhaps in the mid-40s (if you assume it's the last stall that will be the one that isn't hidden). You could also just remove the gather64 function and unroll the whole thing in a 32-byte buffer, with a single load at the end. This result in only one stall, and get rid of the small overhead to load each 64-bit value into the result one at a time, but the overall effect might be worse, since larger loads seem to sometimes suffer larger forwarding stalls.

I'm quite sure you can come with up approaches that are better. For example, you could just write out "long hand" in intrinsics the vpinsrb approach that clang and icc use. That's simple enough that gcc should get it right.

Repeated Offset

What about if the offset array is used repeatedly for several different array inputs?

We can look at pre-processing the offset array so that our core load loop can be faster.

One viable approach is to use vgatherdd to efficiently load elements without bottlenecking on port 5 for the shuffles. We can load the entire gather index vector in a single 256-bit load as well. Unfortunately, the finest-grained vpgather is vpgatherdd which loads 8 32-bit elements using 32-bit offsets. So we'll need 4 of these gathers get all 32 byte-elements, and then need to blend the resulting vectors somehow.

We can actually avoid most of the cost of combining the resulting arrays by interleaving and adjusting the offsets so that the "target" byte in each 32-bit value is actually its correct final position. So you end up with 4 256-bit vectors, each with 8 bytes that you want, in the correct position, and 24 bytes you don't want. You can vpblendw two pairs of vectors together, and then vpblendb those results together, for a total of 3 port 5 uops (there's got to be a better way to do this reduction?).

Adding it all together, I get something like:

4 movups to load the 4 vpgatherdd index regs (can be hoisted)
4 vpgatherdd
2 vpblendw (4 results -> 2)
1 movups to load the vpblendb mask (can be hoisted)
1 vpblendb (2 results -> 1)

Apart from the vpgatherdds it looks like about 9 uops, with 3 of them going to port 5, so 3 cycles bottlenecked on that port or about 2.25 cycles if there are no bottleneck (because the vpgatherdd might not use port 5). On Broadwell, the vpgather family is much improved over Haswell, but still takes about 0.9 cycles per element for vpgatherdd, so that's about 29 cycles right there. So we are right back to where we started, around 32 cycles.

Still, there is some hope:

The 0.9 cycles per element is for mostly pure vpgatherdd activity. Perhaps then the blending code is more or less free, and we are around 29 cycles (realistically, the movups will still be competing with the gather, however).
vpgatherdd got a lot better again in Skylake, to about 0.6 cycles per element, so this strategy will start to help significantly when you upgrade your hardware to Skylake. (And the strategy may pull slightly farther ahead of vpinsrb with AVX512BW, where byte blends with a k-register mask are efficient, and vpgatherdd zmm per-element gather throughput is slightly higher than ymm (InstLatx64).)
Pre-processing gives you the chance to check if duplicate elements are being read from array. In that case, you could potentially reduce the number of gathers. For example, if only half of the elements in offset are unique, you can only do two gathers to collect 16 elements and then pshufb register to duplicate elements as needed. The "reduction" has to be more general, but it doesn't actually seem more expensive (and could be cheaper) since pshufb is quite general does most of the work.

Expanding on that last idea: you would dispatch at runtime to a routine that knows how to do 1, 2, 3 or 4 gathers depending on how many elements are needed. That is fairly quantized, but you could always dispatch in a more fine-grained way with scalar loads (or gathers with larger elements, which are faster) between those cutoff points. You'll hit diminishing returns pretty quickly.

You can even extend that to handling nearby elements - after all, you are grabbing 4 bytes to get a byte: so if any of those 3 wasted bytes is actually at another used offset value, then you get it nearly for free. Now, this needs an even more general reduction phase but it still seems like pshufb will do the heavy lifting and most of the hard work is limited to the pre-processing.

¹ This is one of a handful of SSE/AVX instructions where the memory source form of the instruction is quite a bit more efficient than the reg-reg form: the reg-reg form needs 2 uops on port 5 which limits it to a throughput of 0.5 per cycle and gives it a latency of 2. Evidently the memory load path avoids one of the shuffles/blends that are needed on port 5. vpbroadcastd/q are like that too.

² With two loads and one store per cycle, this is will be running much close to the ragged edge of the maximum theoretical performance: it's maxing out the L1 operation throughput which often results in hiccups: for example, there may not be any spare cycles to accept incoming cache lines from L2.

Maudiemaudlin answered 23/10, 2017 at 8:25 Comment(24)

If load/store uop throughput is a bottleneck, you could do wider loads of offset and movzx / shift to get indices. Unfortunately BMI2 bextr is 2 uops, otherwise it would be perfect. e.g. mov eax, [offset+4] / movzx ecx, ax / (use rcx as an index) / shr eax, 16 / (use rax as an index). So that's 1 load and 2 integer ALU uops instead of 2 movzx load uops. You can extend it to 64-bit loads. You should get something like that from a C compiler if you declare offset as uint32_t * and cast + shift the load result. Or use a union of an array and a uint32. – Submit 23/10, 2017 at 19:24

Yup, I looked at bextr, but it's too slow as you point out (it would be nice to have an immediate version that used a memory operand as the source and was 1 fused uop). I was writing up a version that did 64-bit loads from offset and ALU to get the bytes when I decided that vpgather was the natural extension of this idea. – Maudiemaudlin 23/10, 2017 at 19:29

Also worth considering (but you won't get gcc or clang to generate this asm for you): use integer loads to assemble dwords or qwords before inserting into vectors, to bypass the port5 bottleneck. Assuming Broadwell partial regs are the same as HSW/SKL, mov al, [rdi + rcx] is 1 micro-fused ALU uop (for any port, IIRC) that merges into RAX. movzx eax, [rdi + rcx] / shl eax, 8 / mov al, [rdi + rdx] / shl eax, 8 / ... / vmovq xmm0, rax. – Submit 23/10, 2017 at 19:33

Writing al and then ah before shifting eax by 16 would trigger an extra merge uop, though, and it might have to issue in a cycle by itself. This is kind of the opposite problem of one I've looked at before for Sandybridge, of using byte indices to gather 16-bit elements (for GaloisField 16-bit multiplies in error-correction codes). SnB has 2 xmm integer shuffle units, so pinsrw throughput is 2 per clock. But anyway, movzx from al and ah was good, and gcc didn't do that. (See https://mcmap.net/q/14433/-write-x86-asm-functions-portably-win-linux-osx-without-a-build-depend-on-yasm-nasm/224132 comments for some links to code and a gcc bug). – Submit 23/10, 2017 at 19:42

It's all adding up to just about too many uops though. The movzx/shr "extraction" of the index is taking about two uops per element, and the move al/shr eax,8 "assembly" is another two uops, so you are at 4 uops already. There are a few uops saved at the edges (e.g., after the final shift during extraction the top bits are zero so you don't need a movzx), but those are mostly canceled out by the uops needed per qword (e.g., loading the offset qword for extraction, and the vmovq to insert it in the vector and vector permutes to move it around). – Maudiemaudlin 23/10, 2017 at 19:46

AMD has efficient BEXTR. PowerPC and ARM have efficient bit-field extract (for immediate fields). PPC's is quite powerful: #46870205: rotate and then zero all bits other than [start .. end], so it can scale for free. Too bad x86 doesn't have something like that. – Submit 23/10, 2017 at 19:47

I had thought also about the mov al, [rcx]; mov ah, [rdx]; shl 16; approach to the "assembly" but I recalled your post about ah renaming and the extra uop, which kind of wipes it out (I didn't know it would have to go in its own cycle though, that's much worse). Still you could perhaps combine these "scalar tricks" in parallel with the pinsrb approach, which is uop light and port 5 heavy to get to a place where you are balanced between uop pressure and port 5 pressure. – Maudiemaudlin 23/10, 2017 at 19:49

@PeterCordes - FWIW you can use pext on Intel as a reasonable alternative to bextr, which does the shift & extract & copy in a single uop. Latency of 3 but it's off the critical path. Now you need to load the various 0xFFFF0000 masks (2?), but it can be hoisted if you are looping on this. – Maudiemaudlin 23/10, 2017 at 19:50

Yes, front-end throughput is a problem. Probably a hybrid strategy would be good, where you do some elements one way and others another way. You could also consider doing 2x vpgatherdd and word-blending, then fill in odd bytes with vpinsrb. Cache-line splits from gather elements might hurt that strategy, though, and especially 4k splits will suck on pre-SKL. – Submit 23/10, 2017 at 19:54

Yes, that's what I mean about combining the "scalar tricks" with pinsrb above. There just isn't that much room though: the vector approach is already at ~3 uops, so you let's say you do 16 elements that way (16*3 = 48 uops), you have 16 uops to play with. If we squeeze the scalar stuff into 4 uops, that will take 12 cycles (4 cycles saved due to the "free" 16 uops) and you are at 28 cycles. Well we saved maybe two cycles and it's not nothing! Ignoring some edge effects on both sides (such as the vmovq) as they seem to almost cancel out. Of course some ALU ops will steal port 5... – Maudiemaudlin 23/10, 2017 at 20:3

The best strategy might depend on the surrounding code. If it's front-end bottlenecked, it's useful to start issuing it before the vector is ready. If it also contains some loads or stores, then a gather strategy that spends some ALU uops on offset to reduce load port pressure is nice. BTW, I just tested movzx eax / ah / shl eax, 16 / al / ah / movd xmm0, eax, and the throughput is bad (almost half 3x shl eax, 8 and writing only al), so I think SKL does still take a whole issue cycle for the AH merging uop. – Submit 23/10, 2017 at 20:17

"It depends on the surrounding code" is probably a good bet for any of these "no loop questions". You should update your other post with the detail that the merge (may?) take a whole cycle. What port does the merge go on? The optimal looks like about 3/4 vector and 1/4 scalar, which might et you down to 26 cycles. I would be worried about port 5, but since it isn't a look maybe it can be scheduled OK through trial and error (although I still don't know exactly how that works). – Maudiemaudlin 23/10, 2017 at 20:19

vpblendd y,y,m,i is interesting for the assembly phase: it's only p015 p23 (not micro-fused, unfortunately). You could use that 8 times to get 8 bytes into positions each within a DWORD element - since just changing the offset on the memory operand is "free". Same uop cost as pinsrb but no port 5 bottleneck. Then you combine the vectors with 2 vpblendw and a vpand and vpor or a byte-blend or something like that? Practically none of these blend/shuffle ops are micro-fused, which kind of suck and was news to me. – Maudiemaudlin 23/10, 2017 at 20:30

I think cache-line splits are going to hurt you using 32B loads to get single bytes. Even without that, I've seen a case where even a 64-bit memory operand was slightly slower than a 32-bit memory operand when approaching throughput limits. (That was a weird case where SKL was able to sustain 2 loads + 1 store every clock (from the same addresses every time), which it's supposedly not able to do for vector loads/stores.) So I'd be worried about the throughput impact of using 32B loads even without CL splits. – Submit 23/10, 2017 at 20:40

Right, and there is the little matter of reading out of bounds. Still cache-line splits aren't that bad, they still have a throughput of 1-per cycle and only occur with a chance of about 5%, for randomly distributed offsets (if they are not randomly distributed you can probably adjust your algorithm to make them a bit rarer). Page crossings will suck too as you mentioned earlier. If you were pre-processing, you could mostly avoid cache line crossings by dipatching to a few variants (e.g., varying where the byte ends up in the DWORD and picking the one that minimized crossings.) – Maudiemaudlin 23/10, 2017 at 20:45

I wouldn't read too much (yet) into the 64-bit v 32-bit behavior on the linked question: the add will encode with an extra byte when it's 64-bit and we already know that instruction position affects the uop scheduling. Can't it have just been an extra conflict in that case? – Maudiemaudlin 23/10, 2017 at 20:49

Besides using up two cache-read slots, and there are limited split-load buffers. (There's a ld_blocks.no_sr perf counter for "[The number of times that split load operations are temporarily blocked because all resources for handling the split accesses are in use]). Also, I just tried that 32 / 64 test again, and I can repro it with the instruction size staying the same. (REX.W instead of SIB by using [rsi] instead of [rsp] as the source. Or REX.W=0.) Also, the original test was running from the loop buffer, before the SKL microcode update, and the loop top is 32B-aligned. – Submit 23/10, 2017 at 20:56

Right, but I mean instruction position affects scheduling even in the LSD and uop caches. Still the [rsi] test is pretty convincing. Maybe it's something like a bank conflict (but it seems "too small" to be the usual type of conflict). – Maudiemaudlin 23/10, 2017 at 20:58

I tried again this time with all pointers in the same cache line, and with the write pointer in the next cache line from the two reads (from different 32B chunks of the cache line before the write). No change in results, so it's not 4k aliasing or bank conflicts. Changing the other two instructions to 64-bit operand size makes it even more slow, up to 119.99M cycles per 100M iterations. I thought it might help memory disambiguation to have all operands the same size. BTW, I bet it wouldn't be a problem with no stores in the loop. Or maybe just with more unrolling. – Submit 23/10, 2017 at 21:7

gcc code-gen stupidity reported as gcc.gnu.org/bugzilla/show_bug.cgi?id=82731. – Submit 26/10, 2017 at 10:31

@PeterCordes - weird, I couldn't reproduce your results (which were themselves weird, so I don't know what's weirder). I used the loops here, they vary only in the add edx, [rsp] line. I got 1.00 cycles per iteration for both cases. You can run the test with ./uarch-bench.sh --timer=libpfc --test-name=add-loop. I will look more into it later. I'm on an i7-6700HQ. – Maudiemaudlin 29/10, 2017 at 22:3

I copied your test loop into mine (including the setup with pointers to stack memory instead of BSS), and I was still able to repro it with perf stat on my i7-6700k. With 100M iters, 32-bit operand-size for everything takes 100.3 Mclocks. 64-bit for the add rdx, [rsp] slows it down almost exactly as much as 64-bit for the store, to 112.3 Mcycles. (Same result scaling up to 1G iters). machine-clears were negligible. (It is possible to get memory-ordering machine-clears in single-threaded code; I have a test-case for 32-bit passing stack args to a short function in a loop) – Submit 30/10, 2017 at 10:3

Maybe a hardware difference @Peter. Can you share your test code? Note that I'm measuring the minimum time over 33 trials, 100k loop iters, so that part is different. – Maudiemaudlin 30/10, 2017 at 20:10

@PeterCordes - really interesting results. I changed my loop count to 100m from 100k and immediately reproduced your results. With low iteration count they both run at 1.00 cycles, and somewhere around 1m to 2m iterations the 64-bit version starts to be become unstable, often running at 1.12, but many other values like 1.04 and 1.03 as well. By 100m it's always running at 1.12 cycles. – Maudiemaudlin 30/10, 2017 at 21:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Varying Offset

Repeated Offset

Recommended topics

Hot tags