Accumulate vector of integer with sse
Asked Answered
D

3

7

I tried to change this code to handle std::vector<int>.

float accumulate(const std::vector<float>& v)
{
 // copy the length of v and a pointer to the data onto the local stack
 const size_t N = v.size();
 const float* p = (N > 0) ? &v.front() : NULL;

 __m128 mmSum = _mm_setzero_ps();
 size_t i = 0;

 // unrolled loop that adds up 4 elements at a time
 for(; i < ROUND_DOWN(N, 4); i+=4)
 {
  mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i));
 }

 // add up single values until all elements are covered
 for(; i < N; i++)
 {
  mmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i));
 }

 // add up the four float values from mmSum into a single value and return
 mmSum = _mm_hadd_ps(mmSum, mmSum);
 mmSum = _mm_hadd_ps(mmSum, mmSum);
 return _mm_cvtss_f32(mmSum);
}

Ref: http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html

I changed _mm_setzero_ps to _mm_setzero_si128, _mm_loadu_ps to mm_loadl_epi64 and _mm_add_ps to _mm_add_epi64.

I get this error:

error: cannot convert ‘const int*’ to ‘const __m128i* {aka const __vector(2) long long int*}’ for argument ‘1’ to ‘__m128i _mm_loadl_epi64(const __m128i*)’
         mmSum = _mm_add_epi64(mmSum, _mm_loadl_epi64(p + i + 0));

I am novice in this field. Is there any good source to learn these things?

Disqualification answered 7/10, 2015 at 11:32 Comment(17)
You probably want xxx_epi32 intrinsics, since int is typically 32 bits. And your loads should be _mm_loadu_si128.X
@PaulR Shouldn't using __m128i instead of int work with the code he already has?Jerry
@SimonKraemer: I think it's going to take a lot more than that - not all float (_mm_xxx_ps) intrinsics have an int (_mm_xxx_epi32) equivalent, for example (e.g. _mm_load_ss).X
"I am novice in this field. Is there any good source to learn these things?" -- try searching the [sse] tag right here on StackOverflow - there are lots of good questions and answers and some useful code examples - you can probably learn a lot from these.X
Is there a reason why you need to do this with SSE? On such a trivial operation, you will be bounded by memory bandwidth with or without SSE. But with SSE you need a horizontal add at the end to get the correct result. It's more complicated, and it likely runs none faster. Even more so as -- unless numbers are very small -- you cannot do the prefix sum on large numbers of elements since you will encounter overflow. So... readable C++ code is just as fast, and, well... readable. I'm inclined to call this "premature optimization" par excellence.Abradant
@Damon, I would argue that this is so trivial that a compiler will vectorize this anyway with the right flags.Fiesta
@Zboson: Agree, if alignment allows for it. But still it will make no difference. You have 16 integers (4 SSE registers) worth of data in a cache line, and it takes 150-200 cycles for the prefetcher to load the next cache line. So unless you do something worth at least a hundred cycles, it's entirely nonsensical to even think about optimizing that. An entire 4 add operations (and 16 of them likewise) are not nearly in that ballpark.Abradant
@Damon, yes, which is one reason I'm skeptical to auto-vectorization in the first place. In most cases when it works it's memory bandwidth bound and in the few cases when it's not memory bound auto-vectorization does not work like you want (which is why intrinsics are useful). So in the end you need to do it by hand. What the OP is doing is only useful for education.Fiesta
I used the std::accumulate to add the float numbers. The sse version was ~3 times faster. For integer (current function), it is ~4 times faster. I used g++ and -o3.Disqualification
What was your optimization level? You need -O3 or -Ofast for vectorization. The compiler also won't unroll the loop unless you allow associative math e.g. with -Ofast you might also have to enable -funroll-loops.Fiesta
Also what size was N you used to get a 3x speed up?Fiesta
I tried -Ofast as well. SSE 4: 12ms, accumulate: 52ms. For 30000000 numbers (Instead of 10 in the below for loop).Disqualification
@user1436187: That is possible if you do a lot of int-float conversions (which is not the same as the SSE code below), or if the complete dataset fits into L1 and cache is warm (but then it's kind of pointless). Note that you can trivially make the C++ version twice as fast (presumed that data is in cache) too, simply by summing odd and even elements, and calculating odd+even at the end. That removes the data dependency and allows out of order execution (works with 3 or 4 too, btw).Abradant
@Damon, or you could just let the compiler unroll loop rather than do it by hand...and it's more readable. I wrote a memory bandwidth tool for small (fits in L1) sizes as well as very large sizes (much larger than the TLC) and found that I could get the compiler to get almost optimal performance (it's possible to calculate the maximum performance in this case so I could compare to that) for simple reductions with the right compiler options.Fiesta
Any different between compilers to do the auto vectorization?Disqualification
GCC, ICC, and Clang I think are all pretty good. MSVC 2013 was OK with vectorization. In any case you should look at the assembly.Fiesta
@user1436187, I have to take back a few things I said. It turns out that the only compiler which will vectorize your code and unroll the loop to four partial sums is Clang. I think this explains why you still see a 3x speed up with your own unrolled intrinsic code in GCC. It would be interesting to compare your results with Clang.Fiesta
X
6

Here is an int version which I just threw together:

#include <iostream>
#include <vector>

#include <smmintrin.h>  // SSE4

#define ROUND_DOWN(m, n) ((m) & ~((n) - 1))

static int accumulate(const std::vector<int>& v)
{
    // copy the length of v and a pointer to the data onto the local stack
    const size_t N = v.size();
    const int* p = (N > 0) ? &v.front() : NULL;

    __m128i mmSum = _mm_setzero_si128();
    int sum = 0;
    size_t i = 0;

    // unrolled loop that adds up 4 elements at a time
    for(; i < ROUND_DOWN(N, 4); i+=4)
    {
        mmSum = _mm_add_epi32(mmSum, _mm_loadu_si128((__m128i *)(p + i)));
    }

    // add up the four int values from mmSum into a single value
    mmSum = _mm_hadd_epi32(mmSum, mmSum);
    mmSum = _mm_hadd_epi32(mmSum, mmSum);
    sum = _mm_extract_epi32(mmSum, 0);

    // add up single values until all elements are covered
    for(; i < N; i++)
    {
        sum += p[i];
    }

    return sum;
}

int main()
{
    std::vector<int> v;

    for (int i = 0; i < 10; ++i)
    {
        v.push_back(i);
    }

    int sum = accumulate(v);

    std::cout << sum << std::endl;

    return 0;
}

Compile and run:

$ g++ -Wall -msse4 -O3 accumulate.cpp && ./a.out 
45
X answered 7/10, 2015 at 12:32 Comment(18)
The main thing I would add is to let the compiler do this. Compile with vectorization enabled (-O3) and -funroll-loops since this has a dependency chain. It's also worth pointing out that for floats the compiler won't even unroll unless associative math is enabled e.g. with -Ofast.Fiesta
@Zboson Aren't there alignment issues as well? Are there any guarantees, that vector contents are going to be correctly aligned?Plot
@user1095108, look at the assembly and see what the compiler did. I don't think alignment is an issue. The compiler knows how to handle misalignment as well as non-multiples of the SIMD width (four for SSE). If you want the compiler to use aligned loads you can use this __builtin_assume_aligned. See this sum of overlapping arrays, auto-vectorization, and restrict.Fiesta
@Zboson I was just thinking (I'm not the OP), whether it would be better to store aligned ints in the vector in this case.Plot
@user1095108, oh, you mean can you align std:vector? I don't know. In the cases where I want SIMD I don't use std::vector.Fiesta
@Zboson no, I meant ::std::vector<alignas(16) int>, I'd align the elements of the vector not the vector itself, though this could also be done, for unclear reasons.Plot
@user1095108: I don't think that's valid syntax though ?X
@PaulR Yeah. I don't really know how to force ::std::vector to contain aligned ints, I was simply trying to illustrate what I meant.Plot
@user1095108: the normal way is to pass a custom allocator to the std::vector constructor. See e.g. this question.X
I updated my answer with the assembly for each case including -funroll-loops. This shows that GCC clearly does not unroll unless you use -funroll-loops. But it's not any faster with unrolling. This does not make sense to me. I have not looking to this for about a year so I'm a bit rusty. Any ideas? Cache size could be an issue, alignment...I’ll probably figure it out later.Fiesta
May just be memory bandwidth limited ?X
@PaulR, it's right in the assembly (look at the end of my answer)! It's not doing partial sums. It's still a dependency chain. I'm pretty sure I looked at this a year ago and GCC was creating partial sums. No wonder it gets the same speed as the unrolled version.Fiesta
I don't understand what's wrong with GCC. It's looks like GCC unrolls but still thinks the math is not assoicative so each unroll still depends on the one before it. What's wrong with GCC?Fiesta
@Zboson: I just checked with ICC and that seems to do a much better job - it unrolls the loop without being told to and doesn't have the serial dependencies like gcc.X
@PaulR, thanks for checking! Did you have to enable -Ofast? I think I read that ICC assumes associative math so I think -O3 should be fine. I don't have ICC but I sometimes check it at gcc.godbolt.org. Let me try thatFiesta
@Zboson: No, I just used icpc -O3 -mavx.X
unroll-loop-and-do-independent-sum-with-vectorizationFiesta
Clang seems to unroll 4 times with SSE but it's AVX code looks ugly. At least Clang has the right idea. I did not even have to use -funroll-loops. I should not have to anyway. The compiler should already know to unroll.Fiesta
F
6

The ideal way to do this is to let the compiler auto-vectorize your code and keep your code simple and readable. You don't should not need anything more that

int sum = 0;
for(int i=0; i<v.size(); i++) sum += v[i];

The link you pointed to, http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html, does not seem to understand how to make the compiler vectorize the code.

For floating point, which is what that link uses, what you need to know is that floating point arithmetic is not associative and therefore depends on the order that you do the reduction. GCC, MSVC, and Clang will not do auto-vectorization for a reduction unless you tell it to use a different floating point model otherwise your result could depend on your hardware. ICC, however, defaults to associative floating point math so it will vectorize the code with e.g. -O3.

Not only will GCC, MSVC, and Clang not vectorize unless associative math is allowed but they won't unroll the loop to allow partial sums in order to overcome the latency of the summation. In this case only Clang and ICC will unroll to partial sums anyway. Clang unrolls four times and ICC twice.

One way to enable associative floating point arithmetic with GCC is with the -Ofast flag. With MSVC use /fp:fast

I tested the code below with GCC 4.9.2, XeonE5-1620 (IVB) @ 3.60GHz, Ubuntu 15.04.

-O3 -mavx -fopenmp                       0.93 s
-Ofast -mavx -fopenmp                    0.19 s
-Ofast -mavx -fopenmp -funroll-loops     0.19 s

That's about a five times speed-up. Although, GCC does unroll the loop eight times it does not do independent partial sums (see the assembly below). This is the reason the unrolled version is no better.

I only used OpenMP for its convenient cross-platform/compiler timing function: omp_get_wtime().

Another advantage auto-vectorization has is it works for AVX simply by enabling a compiler switch (e.g. -mavx). Otherwise, if you wanted AVX, you would have to rewrite your code to use the AVX intrinsics and maybe have to ask another question on SO on how to do this.

So currently the only compiler which will auto-vectorize your loop as well as unroll to four partial sums is Clang. See the code and assembly at the end of this answer.


Here is the code I used to test the performance

#include <stdio.h>
#include <omp.h>
#include <vector>

float sumf(float *x, int n)
{
  float sum = 0;
  for(int i=0; i<n; i++) sum += x[i];
  return sum;
}

#define N 10000 // the link used this value
int main(void)
{
  std::vector<float> x;
  for(int i=0; i<N; i++) x.push_back(1 -2*(i%2==0));
  //float x[N]; for(int i=0; i<N; i++) x[i] = 1 -2*(i%2==0);                                                                                                                                                        
  float sum = 0;
  sum += sumf(x.data(),N);
  double dtime = -omp_get_wtime();
  for(int r=0; r<100000; r++) {
    sum += sumf(x.data(),N);
  }
  dtime +=omp_get_wtime();
  printf("sum %f time %f\n", sum, dtime);
}

Edit:

I should have taken my own advice and looked at the assembly.

The main loop for -O3. It's clear it only does a scalar sum.

.L3:
    vaddss  (%rdi), %xmm0, %xmm0
    addq    $4, %rdi
    cmpq    %rax, %rdi
    jne .L3

The main loop for -Ofast. It does a vector sum but no unrolling.

.L8:
    addl    $1, %eax
    vaddps  (%r8), %ymm1, %ymm1
    addq    $32, %r8
    cmpl    %eax, %ecx
    ja  .L8

The main loop for -O3 -funroll-loops. Vector sum with 8x unroll

.L8:
    vaddps  (%rax), %ymm1, %ymm2
    addl    $8, %ebx
    addq    $256, %rax
    vaddps  -224(%rax), %ymm2, %ymm3
    vaddps  -192(%rax), %ymm3, %ymm4
    vaddps  -160(%rax), %ymm4, %ymm5
    vaddps  -128(%rax), %ymm5, %ymm6
    vaddps  -96(%rax), %ymm6, %ymm7
    vaddps  -64(%rax), %ymm7, %ymm8
    vaddps  -32(%rax), %ymm8, %ymm1
    cmpl    %ebx, %r9d
    ja  .L8

Edit:

Putting the following code in Clang 3.7 (-O3 -fverbose-asm -mavx)

float sumi(int *x)
{
  x = (int*)__builtin_assume_aligned(x, 64);
  int sum = 0;
  for(int i=0; i<2048; i++) sum += x[i];
  return sum;
}

produces the following assembly. Notice that it's vectorized to four independent partial sums.

sumi(int*):                              # @sumi(int*)
    vpxor   xmm0, xmm0, xmm0
    xor eax, eax
    vpxor   xmm1, xmm1, xmm1
    vpxor   xmm2, xmm2, xmm2
    vpxor   xmm3, xmm3, xmm3
.LBB0_1:                                # %vector.body
    vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rax]
    vpaddd  xmm1, xmm1, xmmword ptr [rdi + 4*rax + 16]
    vpaddd  xmm2, xmm2, xmmword ptr [rdi + 4*rax + 32]
    vpaddd  xmm3, xmm3, xmmword ptr [rdi + 4*rax + 48]
    vpaddd  xmm0, xmm0, xmmword ptr [rdi + 4*rax + 64]
    vpaddd  xmm1, xmm1, xmmword ptr [rdi + 4*rax + 80]
    vpaddd  xmm2, xmm2, xmmword ptr [rdi + 4*rax + 96]
    vpaddd  xmm3, xmm3, xmmword ptr [rdi + 4*rax + 112]
    add rax, 32
    cmp rax, 2048
    jne .LBB0_1
    vpaddd  xmm0, xmm1, xmm0
    vpaddd  xmm0, xmm2, xmm0
    vpaddd  xmm0, xmm3, xmm0
    vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
    vpaddd  xmm0, xmm0, xmm1
    vphaddd xmm0, xmm0, xmm0
    vmovd   eax, xmm0
    vxorps  xmm0, xmm0, xmm0
    vcvtsi2ss   xmm0, xmm0, eax
    ret
Fiesta answered 7/10, 2015 at 15:57 Comment(0)
A
1
static inline int32_t accumulate(const int32_t *data, size_t size) {
  constexpr const static size_t batch = 256 / 8 / sizeof(int32_t);
  int32_t sum = 0;
  size_t pos = 0;

  if (size >= batch) {
    // 7
    __m256i mmSum = _mm256_loadu_si256((__m256i *)(data));
    pos = batch;

    // unrolled loop
    for (; pos + batch < size; pos += batch) {
      // 1 + 7
      mmSum =
          _mm256_add_epi32(mmSum, _mm256_loadu_si256((__m256i *)(data + pos)));
    }

    mmSum = _mm256_hadd_epi32(mmSum, mmSum);
    mmSum = _mm256_hadd_epi32(mmSum, mmSum);
    // 2 + 1 + 3 + 0
    sum = _mm_cvtsi128_si32(_mm_add_epi32(_mm256_extractf128_si256(mmSum, 1),
                                          _mm256_castsi256_si128(mmSum)));
  }

  // add up remain values
  while (pos < size) {
    sum += data[pos++];
  }
  return sum;
}
Amati answered 6/11, 2020 at 6:9 Comment(2)
That hsum is pretty inefficient; _mm256_extract_epi32(mmSum, 4); requires at least 2 instructions (e.g. vextracti128 xmm, ymm, 1 / vmovd edx, xmm0). And hadd costs 2 shuffles + 1 add, and is even worse on AMD. Fastest way to do horizontal SSE vector sum (or other reduction) has some links, including Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2.Humfried
Also, if you're going to put a _mm256_loadu_si256 inside a ternary, the compiler will have to branch on it. As written, your code will do 2x hadd + extract on a zero vector, instead of just branching over the whole SIMD part by putting it inside an if(size >= batch). Compilers might be smart and figure that out for you, but the only downside would be a deeper level of indentation for the SIMD part of the function.Humfried

© 2022 - 2024 — McMap. All rights reserved.