Fastest way to do horizontal vector sum with AVX instructions [duplicate]

Asked 19/3, 2012 at 18:11 Answered 14/8, 2014 at 15:45

Solved x86 sse simd avx vector-processing

I have a packed vector of four 64-bit floating-point values.
I would like to get the sum of the vector's elements.

With SSE (and using 32-bit floats) I could just do the following:

v_sum = _mm_hadd_ps(v_sum, v_sum);
v_sum = _mm_hadd_ps(v_sum, v_sum);

Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.

Ideally, the solution I am looking for should follow these guidelines:
1) only use AVX/AVX2 instructions. (no SSE)
2) do it in no more than 2-3 instructions.

However, any efficient/elegant way to do it (even without following the above guidelines) is always well accepted.

Thanks a lot for any help.

-Luigi Castelli

Dukedom answered 19/3, 2012 at 18:11 Comment(1)

Start with _mm256_extractf128_ps, _mm_add_ps the two halves together, then use the existing methods for reducing a 128b vector. – Exon 12/4, 2016 at 21:11

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do:

__m256d x1, x2;
// calculate 4 two-element horizontal sums:
// lower 64 bits contain x1[0] + x1[1]
// next 64 bits contain x2[0] + x2[1]
// next 64 bits contain x1[2] + x1[3]
// next 64 bits contain x2[2] + x2[3]
__m256d sum = _mm256_hadd_pd(x1, x2);
// extract upper 128 bits of result
__m128d sum_high = _mm256_extractf128_pd(sum, 1);
// add upper 128 bits of sum to its lower 128 bits
__m128d result = _mm_add_pd(sum_high, _mm256_castpd256_pd128(sum));
// lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
// upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]

So it looks like 3 instructions will do 2 of the horizontal sums that you need. The above is untested, but you should get the concept.

Kohlrabi answered 19/3, 2012 at 19:24 Comment(6)

Instead of using casting with (__m128d) I think you should use _mm256_castpd256_pd128(sum). – Mallee 5/12, 2013 at 17:22

What compiler did you use for this? I asked a question about your cast here c-style-cast-versus-intrinsic-cast. I don't have access to system to test this on right now. – Mallee 5/12, 2013 at 18:27

@Zboson: I edited the snippet to use the more standard casting method. – Kohlrabi 12/2, 2014 at 13:25

This is less efficient than possible on Intel CPUs, and very inefficient on AMD CPUs. See Get sum of values stored in __m256d with SSE/AVX. It's not wrong, but the question asked for "fastest", not simple or compact / small code-size. – Exon 29/1, 2019 at 19:50

I think I misread this earlier. You're doing 2 separate horizontal sums. That's actually good on Intel, and maybe ok on Zen or Zen2. But not if you're ultimately going to add the two halves of __m128d result to each other. So to make good use of this, you need to have two independent things you need to hsum. – Exon 31/3, 2020 at 6:19

simple typo (can't edit), I believe in the line __m128d sum_high = _mm256_extractf128_pd(sum1, 1); sum1 should be sum – Charissacharisse 11/5, 2022 at 13:53

If you want just the sum, and a bit of scalar code is acceptable:

__m256d x;
__m256d s = _mm256_hadd_pd(x,x);
return ((double*)&s)[0] + ((double*)&s)[2];

Aftergrowth answered 5/12, 2013 at 1:39 Comment(2)

Smells like undefined behavior with that cast. – Pryer 21/10, 2018 at 23:5

@wingerse: Indeed, support for Intel intrinsics does not imply well-defined behaviour for pointing a plain type into a __mxxx type, only the other way around. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? This is strict-aliasing UB, and may break in practice on GCC or clang. – Exon 9/5, 2022 at 21:41

Assuming the following, that you have a __m256d vector containing 4 packed doubles and you would like to calculate the sum of its components, ie a0, a1, a2, a3 is each double component you would like a0 + a1 + a2 + a3 then heres another AVX solution:

// goal to calculate a0 + a1 + a2 + a3
__m256d values = _mm256_set_pd(23211.24, -123.421, 1224.123, 413.231);

// assuming _mm256_hadd_pd(a, b) == a0 + a1, b0 + b1, a2 + a3, b2 + b3 (5 cycles) ...
values = _mm256_hadd_pd(values, _mm256_permute2f128_pd(values, values, 1));
// ^^^^^^^^^^^^^^^^^^^^ a0 + a1, a2 + a3, a2 + a3, a0 + a1

values = _mm256_hadd_pd(values, values);
// ^^^^^^^^^^^^^^^^^^^^ (a0 + a1 + a2 + a3), (a0 + a1 + a2 + a3), (a2 + a3 + a0 + a1), (a2 + a3 + a0 + a1)

// Being that addition is associative then each component of values contains the sum of all its initial components (11 cycles) to calculate, (1-2 cycles) to extract, total (12-13 cycles)
double got = _mm_cvtsd_f64(_mm256_castpd256_pd128(values)), exp = (23211.24 + -123.421 + 1224.123 + 413.231);

if (got != exp || _mm256_movemask_pd(_mm256_cmp_pd(values, _mm256_set1_pd(exp), _CMP_EQ_OS)) != 0b1111)
    printf("Failed to sum double components, exp: %f, got %f\n", exp, got);
else
    printf("ok\n");

This solution has the sum broadcasted which maybe useful ...

If I misinterpreted the problem I do apologize.

$ uname -a
Darwin Samys-MacBook-Pro.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64

$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix

Pore answered 14/8, 2014 at 15:45 Comment(1)

If you want the result sum to all elements, hadd is still not the most efficient way. You only need 1 shuffle + add, not HADD's 2x shuffle uops that feed one add. e.g. val = add(val, permute2f128(val,val 1)); then swap within 128-bit lanes with _mm256_shuffle_pd to feed another add. On Intel that's only 4 uops total, same as Get sum of values stored in __m256d with SSE/AVX – Exon 29/1, 2019 at 19:53

Recommended topics

Hot tags