Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2
Asked Answered
B

1

6

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.

Edit: best/optimal in term of speed/cycle reduction.

Botha answered 7/2, 2020 at 7:8 Comment(4)
This seems like a micro-optimization. How much better do you expect to do? Are you optimizing for code size, number of instructions, throughput, latency? An instruction you could look at is the AVX2 VPHADDD (horizontal add doublewords from 256-bit vectors) but you can't cheat your way out of the permute and add uops this instruction expands to.Excerpt
@IwillnotexistIdonotexist: you never want VPHADDD for efficient horizontal sums. See Fastest way to do horizontal float vector sum on x86 (my answer also includes some integer versions). You just want to successively narrow until you're down to 1 element, with extract high lane and then shuffles within a 128-bit vector. All your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd.Eusporangiate
@PeterCordes OP's question is so broad I don't know what we're tuning for. What's "optimal"/"best"? I did highlight that VPHADDD inescapably generates uops you'd have seen anyways with a vpadd/vpermd-type solution, but at least it's a single instruction (code size). If throughput is the target, then perhaps batching 8 of these jobs, transposing 8x8 in 24 permute instructions and using 7 vertical adds might work best (helps if your CPU has two shuffle units, then you can expect to finish the batch of 8 in about 12+4=16cc=amortized 2cc, else it's 24+4=28cc=amortized 3.5cc, still quite good).Excerpt
@IwillnotexistIdonotexist: All CPUs that support AVX512 and vphadd implement it as 2 shuffle uops + 1 vertical add uop. (AMD before Zen2 decodes it extra inefficiently for the ymm case, as 8 total uops not 6. Or 4 uops for the xmm version.) So yes, transpose and hadd is a use-case for vphadd. You didn't mention that it decodes to more shuffles than necessary for an hsum of one vector, and using it that way is a common missed optimization. (Or code-size vs. speed tradeoff). Over-use of phadd is a common mistake.Eusporangiate
E
18

Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling.

Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel's reduce_add helper function. reduce_add doesn't necessarily compile optimally anyway with AVX512.


There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.

GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:

        vextracti64x4   ymm1, zmm0, 0x1
        vpaddd  ymm1, ymm1, ymm0
        vextracti64x2   xmm0, ymm1, 0x1   # silly compiler, vextracti128 would be shorter
        vpaddd  xmm1, xmm0, xmm1
        vpshufd xmm0, xmm1, 78
        vpaddd  xmm0, xmm0, xmm1

        vmovd   edx, xmm0
        vpextrd eax, xmm0, 1              # 2x xmm->integer to feed scalar add.
        add     eax, edx
        ret

Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.

Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.


There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)

Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).


General strategy for all SSE / AVX / AVX512

You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.

Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)

See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).

This general strategy is appropriate for all element types: float, double, and any size integer

Special cases:

  • 8-bit integer: start with vpsadbw, more efficient and avoids overflow, but then continue as for 64-bit integers.

  • 16-bit integer: start by widening to 32 with pmaddwd (_mm256_madd_epi16 with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.

32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i, in turn called by the AVX512 function after reducing to __m256i. The calls will of course inline in practice.

#include <immintrin.h>
#include <stdint.h>

// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // movd
}

// only needs AVX2
uint32_t hsum_8x32(__m256i v)
{
    __m128i sum128 = _mm_add_epi32( 
                 _mm256_castsi256_si128(v),
                 _mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
    return hsum_epi32_avx(sum128);
}

// AVX512
uint32_t hsum_16x32(__m512i v)
{
    __m256i sum256 = _mm256_add_epi32( 
                 _mm512_castsi512_si256(v),  // low half
                 _mm512_extracti64x4_epi64(v, 1));  // high half.  AVX512F.  32x8 version is AVX512DQ
    return hsum_8x32(sum256);
}

Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.

Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).

But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.

Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512

hsum_16x32(long long __vector(8)):
        vextracti64x4   ymm1, zmm0, 0x1
        vpaddd  ymm0, ymm1, ymm0
        vextracti64x2   xmm1, ymm0, 0x1   # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
        vpaddd  xmm0, xmm0, xmm1
        vpunpckhqdq     xmm1, xmm0, xmm0
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 177
        vpaddd  xmm0, xmm1, xmm0
        vmovd   eax, xmm0
        ret

P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:

After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.

But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:

  • total uops: reduce: 4. Mine: 3
  • ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
  • ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)

Latency is equal at 4 cycles, assuming no resource conflicts:

  • shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
  • vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.
Eusporangiate answered 7/2, 2020 at 8:26 Comment(4)
Thank you for your very detailed answer. Have a good day Peter! I made a version same as yours, however without the narrow-down part. According to your answer, I should improve it. A small follow-up questions: how can you compare the runtime of vextracti64x2 and vextracti128? The finder website does not specify the latency of most AVX512 functions.Botha
@thnghh: The timings on Intel's intrinsics guide are usually correct, but it doesn't even give uop counts. (Intel Intrinsics guide - Latency and Throughput). It's not useful for more than a rough guideline. Updated my answer with links to uops.info (very detailed, and should be free from typos because it's machine-generated) and Agner Fog's experimental results (easy to search, occasional typos and even inaccuracies like missed special cases).Eusporangiate
@Peter Cordes: Respect! Is there a good answer to the same question for single- and double-precision floating point numbers?Meniscus
@egyik: yes, already linked from this answer: Fastest way to do horizontal SSE vector sum (or other reduction). Like I said in this answer, just use the equivalent _ps intrinsics instead of _epi32 to reduce down to a __m128 or __m128d vector.Eusporangiate

© 2022 - 2024 — McMap. All rights reserved.