Fastest way to do horizontal vector sum with AVX instructions [duplicate]
Asked Answered
D

3

23

I have a packed vector of four 64-bit floating-point values.
I would like to get the sum of the vector's elements.

With SSE (and using 32-bit floats) I could just do the following:

v_sum = _mm_hadd_ps(v_sum, v_sum);
v_sum = _mm_hadd_ps(v_sum, v_sum);

Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.

Ideally, the solution I am looking for should follow these guidelines:
1) only use AVX/AVX2 instructions. (no SSE)
2) do it in no more than 2-3 instructions.

However, any efficient/elegant way to do it (even without following the above guidelines) is always well accepted.

Thanks a lot for any help.

-Luigi Castelli

Dukedom answered 19/3, 2012 at 18:11 Comment(1)
Start with _mm256_extractf128_ps, _mm_add_ps the two halves together, then use the existing methods for reducing a 128b vector.Exon
K
30

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do:

__m256d x1, x2;
// calculate 4 two-element horizontal sums:
// lower 64 bits contain x1[0] + x1[1]
// next 64 bits contain x2[0] + x2[1]
// next 64 bits contain x1[2] + x1[3]
// next 64 bits contain x2[2] + x2[3]
__m256d sum = _mm256_hadd_pd(x1, x2);
// extract upper 128 bits of result
__m128d sum_high = _mm256_extractf128_pd(sum, 1);
// add upper 128 bits of sum to its lower 128 bits
__m128d result = _mm_add_pd(sum_high, _mm256_castpd256_pd128(sum));
// lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
// upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]

So it looks like 3 instructions will do 2 of the horizontal sums that you need. The above is untested, but you should get the concept.

Kohlrabi answered 19/3, 2012 at 19:24 Comment(6)
Instead of using casting with (__m128d) I think you should use _mm256_castpd256_pd128(sum).Mallee
What compiler did you use for this? I asked a question about your cast here c-style-cast-versus-intrinsic-cast. I don't have access to system to test this on right now.Mallee
@Zboson: I edited the snippet to use the more standard casting method.Kohlrabi
This is less efficient than possible on Intel CPUs, and very inefficient on AMD CPUs. See Get sum of values stored in __m256d with SSE/AVX. It's not wrong, but the question asked for "fastest", not simple or compact / small code-size.Exon
I think I misread this earlier. You're doing 2 separate horizontal sums. That's actually good on Intel, and maybe ok on Zen or Zen2. But not if you're ultimately going to add the two halves of __m128d result to each other. So to make good use of this, you need to have two independent things you need to hsum.Exon
simple typo (can't edit), I believe in the line __m128d sum_high = _mm256_extractf128_pd(sum1, 1); sum1 should be sumCharissacharisse
A
4

If you want just the sum, and a bit of scalar code is acceptable:

__m256d x;
__m256d s = _mm256_hadd_pd(x,x);
return ((double*)&s)[0] + ((double*)&s)[2];
Aftergrowth answered 5/12, 2013 at 1:39 Comment(2)
Smells like undefined behavior with that cast.Pryer
@wingerse: Indeed, support for Intel intrinsics does not imply well-defined behaviour for pointing a plain type into a __mxxx type, only the other way around. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? This is strict-aliasing UB, and may break in practice on GCC or clang.Exon
P
4

Assuming the following, that you have a __m256d vector containing 4 packed doubles and you would like to calculate the sum of its components, ie a0, a1, a2, a3 is each double component you would like a0 + a1 + a2 + a3 then heres another AVX solution:

// goal to calculate a0 + a1 + a2 + a3
__m256d values = _mm256_set_pd(23211.24, -123.421, 1224.123, 413.231);

// assuming _mm256_hadd_pd(a, b) == a0 + a1, b0 + b1, a2 + a3, b2 + b3 (5 cycles) ...
values = _mm256_hadd_pd(values, _mm256_permute2f128_pd(values, values, 1));
// ^^^^^^^^^^^^^^^^^^^^ a0 + a1, a2 + a3, a2 + a3, a0 + a1

values = _mm256_hadd_pd(values, values);
// ^^^^^^^^^^^^^^^^^^^^ (a0 + a1 + a2 + a3), (a0 + a1 + a2 + a3), (a2 + a3 + a0 + a1), (a2 + a3 + a0 + a1)

// Being that addition is associative then each component of values contains the sum of all its initial components (11 cycles) to calculate, (1-2 cycles) to extract, total (12-13 cycles)
double got = _mm_cvtsd_f64(_mm256_castpd256_pd128(values)), exp = (23211.24 + -123.421 + 1224.123 + 413.231);

if (got != exp || _mm256_movemask_pd(_mm256_cmp_pd(values, _mm256_set1_pd(exp), _CMP_EQ_OS)) != 0b1111)
    printf("Failed to sum double components, exp: %f, got %f\n", exp, got);
else
    printf("ok\n");

This solution has the sum broadcasted which maybe useful ...

If I misinterpreted the problem I do apologize.

$ uname -a
Darwin Samys-MacBook-Pro.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64

$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
Pore answered 14/8, 2014 at 15:45 Comment(1)
If you want the result sum to all elements, hadd is still not the most efficient way. You only need 1 shuffle + add, not HADD's 2x shuffle uops that feed one add. e.g. val = add(val, permute2f128(val,val 1)); then swap within 128-bit lanes with _mm256_shuffle_pd to feed another add. On Intel that's only 4 uops total, same as Get sum of values stored in __m256d with SSE/AVXExon

© 2022 - 2024 — McMap. All rights reserved.