SIMD XOR operation is not as effective as Integer XOR?

Asked 29/4, 2014 at 8:54 Answered 17/3, 2016 at 20:43

c++performance parallel-processing simd seeding

I have a task to calculate xor-sum of bytes in an array:

X = char1 XOR char2 XOR char3 ... charN;

I'm trying to parallelize it, xoring __m128 instead. This should give speed up factor 4. Also, to recheck the algorithm I use int. This should give speed up factor 4. The test program is 100 lines long, I can't make it shorter, but it is simple:

#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>

#include <iostream>
using namespace std;

#include <stdlib.h> // rand

const int NIter = 100;

const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));

template< typename T >
T Sum(const T* data, const int N)
{
    T sum = 0;
    for ( int i = 0; i < N; ++i )
      sum = sum ^ data[i];
    return sum;
}

template<>
__m128 Sum(const __m128* data, const int N)
{
    __m128 sum = _mm_set_ps1(0);
    for ( int i = 0; i < N; ++i )
        sum = _mm_xor_ps(sum,data[i]);
    return sum;
}

int main() {

    // fill string by random values
  for( int i = 0; i < N; i++ ) {
    str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
  } 

    /// -- CALCULATE --

    /// SCALAR

  unsigned char sumS = 0;
  std::clock_t c_start = std::clock();
  for( int ii = 0; ii < NIter; ii++ )
    sumS = Sum<unsigned char>( str, N );
  double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;

    /// SIMD

  unsigned char sumV = 0;

  const int m128CharLen = 4*4;
  const int NV = N/m128CharLen;

  c_start = std::clock();
  for( int ii = 0; ii < NIter; ii++ ) {
    __m128 sumVV = _mm_set_ps1(0);
    sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
    unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);

    sumV = sumVS[0];
    for ( int iE = 1; iE < m128CharLen; ++iE )
      sumV ^= sumVS[iE];
  }
  double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;

    /// SCALAR INTEGER

  unsigned char sumI = 0;

  const int intCharLen = 4;
  const int NI = N/intCharLen;

  c_start = std::clock();
  for( int ii = 0; ii < NIter; ii++ ) {
    int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
    unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);

    sumI = sumIS[0];
    for ( int iE = 1; iE < intCharLen; ++iE )
      sumI ^= sumIS[iE];
  }
  double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;

    /// -- OUTPUT --

  cout << "Time scalar: " << tScal << " ms " << endl;
  cout << "Time INT:   " << tINT << " ms, speed up " << tScal/tINT << endl;
  cout << "Time SIMD:   " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;

  if(sumV == sumS && sumI == sumS )
    std::cout << "Results are the same." << std::endl;
  else
    std::cout << "ERROR! Results are not the same." << std::endl;

  return 1;
}

The typical results:

[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms 
Time INT:   890 ms, speed up 3.97753
Time SIMD:   280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms 
Time INT:   890 ms, speed up 3.97753
Time SIMD:   280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms 
Time INT:   880 ms, speed up 4.13636
Time SIMD:   290 ms, speed up 12.5517
Results are the same.

As you see, int version works ideally, but simd version loses 25% of the speed and this is stable. I tried to change the array sizes, this doesn't help.

Also, if I switch to -O2 I lose 75% of the speed in simd version:

[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms 
Time INT:   880 ms, speed up 4.13636
Time SIMD:   890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms 
Time INT:   900 ms, speed up 4.04444
Time SIMD:   880 ms, speed up 4.13636
Results are the same.

Can someone explain me this?

Additional info:

I have g++ (GCC) 4.7.3; Intel(R) Xeon(R) CPU E7-4860

I use -fno-tree-vectorize to prevent auto vectorization. Without this flag with -O3 the expected speed up is 1, since the task is simple. This is what I get:

[10:55:40]$ g++ test.cpp -O3; ./a.out
Time scalar: 270 ms 
Time INT:   270 ms, speed up 1
Time SIMD:   280 ms, speed up 0.964286
Results are the same.

but with -O2 result is still strange:

[10:55:02]$ g++ test.cpp -O2; ./a.out
Time scalar: 3540 ms 
Time INT:   990 ms, speed up 3.57576
Time SIMD:   880 ms, speed up 4.02273
Results are the same.

When I change
```
for ( int i = 0; i < N; i+=1 )
  sum = sum ^ data[i];
```
to equivalent of:
```
for ( int i = 0; i < N; i+=8 )
  sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;
```
i do see improvment in scalar speed by factor of 2. But I don't see improvements in speed up. Before: intSpeedUp 3.98416, SIMDSpeedUP 12.5283. After: intSpeedUp 3.5572, SIMDSpeedUP 6.8523.

Parkin answered 29/4, 2014 at 8:54 Comment(5)

can you turn on the -vec-report3 flag and see if the loops really got vectorized – Purkey 29/4, 2014 at 9:0

@arunmoezhi, what do you mean? Which loops must be vectorized?? -vec-report3 is not recognised by my gcc. – Parkin 29/4, 2014 at 9:1

the scalar version. Why didn't the compiler optimize it – Purkey 29/4, 2014 at 9:5

@arunmoezhi, because of -fno-tree-vectorize flag. – Parkin 29/4, 2014 at 9:9

try _mm_load_si128? – Sandhurst 17/3, 2016 at 17:18

I think you may be bumping into the upper limits of memory bandwidth. This might be the reason for the 12.6x speedup instead of 16x speedup in the -O3 case.

However, gcc 4.7.3 puts a useless store instruction into the tiny not-unrolled vector loop when inlining, but not in the scalar or int SWAR loops (see below), so that might be the explanation instead.

The -O2 reduction in vector throughput is all due to gcc 4.7.3 doing an even worse job there and sending the accumulator on a round trip to memory (store-forwarding).

For analysis of the implications of that extra store instruction, see the section at the end.

TL;DR: Nehalem likes a bit more loop unrolling than SnB-family requires, and gcc has made major improvements in SSE code-generation in gcc5.

And typically use _mm_xor_si128, not _mm_xor_ps for bulk xor work like this.

Memory bandwidth.

N is huge (40MB), so memory/cache bandwidth is a concern. A Xeon E7-4860 is a 32nm Nehalem microarchitecture, with 256kiB of L2 cache (per core), and 24MiB of shared L3 cache. It has a quad-channel memory controller supporting up to DDR3-1066 (compared to dual-channel DDR3-1333 or DDR3-1600 for typical desktop CPUs like SnB or Haswell).

A typical 3GHz desktop Intel CPU can sustain a load bandwidth of something like ~8B / cycle from DRAM, in theory. (e.g. 25.6GB/s theoretical max memory BW for an i5-4670 with dual channel DDR3-1600). Achieving this in an actual single thread might not work, esp. when using integer 4B or 8B loads. For a slower CPU like a 2267MHz Nehalem Xeon, with quad-channel (but also slower) memory, 16B per clock is probably pushing the upper limits.

I had a look at the asm from the original unchanged code with gcc 4.7.3 on godbolt.

The stand-alone version looks fine (but the inlined version isn't), see below!), with the loop being

## float __vector Sum(...) non-inlined version
.L3:
        xorps   xmm0, XMMWORD PTR [rdi]
        add     rdi, 16
        cmp     rdi, rax
        jne     .L3

That's 3 fused-domain uops, and should issue and execute at one iteration per clock. Actually, it can't because xorps and fused compare-and-branch both need port5.

N is huge, so the overhead of the clunky char-at-a-time horizontal XOR doesn't come into play, even though gcc 4.7 emits abysmal code for it (multiple copies of sumVV stored to the stack, etc. etc.). (See Fastest way to do horizontal float vector sum on x86 for ways to reduce down to 4B with SIMD. It might be faster to then movd the data into integer regs and use integer shift/xor there for the last 4B -> 1B, esp. if you're not using AVX. The compiler might be able to take advantage of al/ah low and high 8bit component regs.)

The vector loop was inlined stupidly:

## float __vector Sum(...) inlined into main at -O3
.L12:
        xorps   xmm0, XMMWORD PTR [rdx]
        add     rdx, 16
        cmp     rdx, rbx
        movaps  XMMWORD PTR [rsp+64], xmm0
        jne     .L12

It's storing the accumulator every iteration, instead of just after the last iteration! Since gcc doesn't / didn't default to optimizing for macro-fusion, it didn't even put the cmp/jne next to each other where they can fuse into a single uop on Intel and AMD CPUs, so the loop has 5 fused-domain uops. This means it can only issue at one per 2 clocks, if the Nehalem frontend / loop buffer is anything like the Sandybridge loop buffer. uops issue in groups of 4, and a predicted-taken branch ends an issue block. So it issues in a 4/1/4/1 uop pattern, not 4/4/4/4. This means we can get at best one 16B load per 2 clocks of sustained throughput.

-mtune=core2 might double the throughput, because it puts the cmp/jne together. The store can micro-fuse into a single uop, and so can the xorps with a memory source operand. A gcc that old doesn't support -mtune=nehalem, or the more generic -mtune=intel. Nehalem can sustain one load and one store per clock, but obviously it would be far better not to have a store in the loop at all.

Compiling with `-O2` makes even worse code with that gcc version:

The inlined inner loop now loads the accumulator from memory as well as storing it, so there's a store-forwarding round trip in the loop-carried dependency that the accumulator is part of:

## float __vector Sum(...) inlined at -O2
.L14:
        movaps  xmm0, XMMWORD PTR [rsp+16]   # reload sum
        xorps   xmm0, XMMWORD PTR [rdx]      # load data[i]
        add     rdx, 16
        cmp     rdx, rbx
        movaps  XMMWORD PTR [rsp+16], xmm0   # spill sum
        jne     .L14

At least with -O2 the horizontal byte-xor compiles to just a plain integer byte loop without spewing 15 copies copies of xmm0 onto the stack.

This is just totally braindead code, because we haven't let a reference / pointer to sumVV escape the function, so there are no other threads that could be observing the accumulator in progress. (And even if so, there's no synchronization stopping gcc from just accumulating in a reg and storing the final result). The non-inlined version is still fine.

That massive performance bug is still present all the way up to gcc 4.9.2, with -O2 -fno-tree-vectorize, even when I rename the function from main to something else, so it gets the full benefit of gcc's optimization efforts. (Don't put microbenchmarks inside main, because gcc marks it as "cold" and optimizes less.)

gcc 5.1 makes good code for the inlined version of template<> __m128 Sum(const __m128* data, const int N). I didn't check with clang.

This extra loop-carried dep chain is almost certainly why the vector version has a smaller speedup with -O2. i.e. it's a compiler bug that's fixed in gcc5.

The scalar version with -O2 is

.L12:
        xor     bpl, BYTE PTR [rdx]       # sumS, MEM[base: D.27594_156, offset: 0B]
        add     rdx, 1    # ivtmp.135,
        cmp     rdx, rbx  # ivtmp.135, D.27613
        jne     .L12      #,

so it's basically optimal. Nehalem can only sustain one load per clock, so there's no need to use more accumulators.

The int version is

.L18:
        xor     ecx, DWORD PTR [rdx]      # sum, MEM[base: D.27549_296, offset: 0B]
        add     rdx, 4    # ivtmp.135,
        cmp     rbx, rdx  # D.27613, ivtmp.135
        jne     .L18      #,

so again, it's what you'd expect. It should be sustaining on load per clock.

For uarches that can sustain two loads per clock (Intel SnB-family, and AMD), you should be using two accumulators. compiler-implemented -funroll-loops usually just reduces loop overhead without introducing multiple accumulators. :(

You want the compiler to make code like:

        xorps   xmm0, xmm0
        xorps   xmm1, xmm1
.Lunrolled:
        pxor    xmm0, XMMWORD PTR [rdi]
        pxor    xmm1, XMMWORD PTR [rdi+16]
        pxor    xmm0, XMMWORD PTR [rdi+32]
        pxor    xmm1, XMMWORD PTR [rdi+48]
        add     rdi, 64
        cmp     rdi, rax
        jb  .Lunrolled

        pxor    xmm0, xmm1

        # horizontal xor of xmm0
        movhlps xmm1, xmm0
        pxor    xmm0, xmm1
        ...

Urolling by two (pxor / pxor / add / cmp/jne) would make a loop that can issue at one iteration per 1c, but requires four ALU execution ports. Only Haswell and later can keep up with that throughput. (Or AMD Bulldozer-family, because vector and integer instructions don't compete for execution ports, but conversely there are only two integer ALU pipes, so they only max out their instruction throughput with mixed code.)

This unroll by four is 6 fused-domain uops in the loop, so it can easily issue at one per 2c, and SnB/IvB can keep up with three ALU uops per clock.

Note that on Intel Nehalem through Broadwell, pxor (_mm_xor_si128) has better throughput than xorps (_mm_xor_ps), because it can run on more execution ports. If you're using AVX but not AVX2, it can make sense to use 256b _mm256_xor_ps instead of _mm_xor_si128, because _mm256_xor_si256 requires AVX2.

If it's not memory bandwidth, why is it only 12.6x speedup?

Nehalem's loop buffer (aka Loop Stream Decoder or LSD) has a "one clock delay" (according to Agner Fog's microarch pdf), so a loop with N uops will take ceil(N/4.0) + 1 cycles to issue out of the loop buffer if I understand him correctly. He doesn't explicitly say what happens to the last group of uops if there are less than 4, but SnB-family CPUs work this way (divide by 4 and round up). They can't issue uops from the next iteration following the taken branch. I tried to google about nehalem, but couldn't find anything useful.

So the char and int loops are presumably running at one load & xor per 2 clocks (since they're 3 fused-domain uops). Loop unrolling could ~double their throughput up to the point where they saturate the load port. SnB-family CPUs don't have that one clock delay, so they can run tiny loops at one clock per iteration.

Using perf counters or at least microbenchmarks to make sure that your absolute throughput is what you expect is a good idea. With just your relative measurements, you have no indication without this kind of analysis that you're leaving half your performance on the table.

The vector -O3 loop is 5 fused-domain uops, so it should be taking three clock cycles to issue. Doing 16x as much work, but taking 3 cycles per iteration instead of 2 would give us a speedup of 16 * 2/3 = 10.66. We're actually getting somewhat better than that, which I don't understand.

I'm going to stop here, instead of digging out a nehalem laptop and running actual benchmarks, since Nehalem is too old to be interesting to tune for at this level of detail.

Did you maybe compile with -mtune=core2? Or maybe your gcc had a different default tune setting, and didn't split up the compare-and-branch? In that case, the frontend probably wasn't the bottleneck, and throughput was maybe slightly limited by memory bandwidth, or by memory false dependencies:

Core 2 and Nehalem both have a false dependence between memory addresses with the same set and offset, i.e. with a distance that is a multiple of 4 kB.

This might cause a short bubble in the pipeline every 4k.

Before I checked on Nehalem's loop buffer and found the extra 1c per loop, I had a theory which I'm now confident is incorrect:

I thought the extra store uop in the loop that bumps it up over 4 uops would essentially cut the speed in half, so you'd see a speedup of ~6. However, maybe there are some execution bottlenecks that make the frontend issue throughput not the bottleneck after all?

Or maybe Nehalem's loop buffer is different from SnB's, and doesn't end an issue group at a predicted-taken branch. This would give a thoughput speedup of 16 * 4/5 = 12.8, for the -O3 vector loop, if it's 5 fused-domain uops can issue at a consistent 4 per clock. This matches the experimental data of 12.6429 speedup factor very well: slightly less than 12.8 is to be expected because of increased bandwidth requirements (occasional cache miss stalls when the prefetcher falls behind).

(The scalar loops still just run one iteration per clock: issuing more than one iteration per clock just means they bottleneck on one load per clock, and the 1 cycle xor loop-carried dependency.)

This can't be right because xorps in Nehalem can only run on port5, same as a fused compare-and-branch. So there's no way the non-unrolled vector loop could be running at more than one iteration per 2 cycles.

According to Agner Fog's tables, conditional branches have a throughput of one per 2c on Nehalem, further confirming that this is a bogus theory.

Brunell answered 17/3, 2016 at 20:43 Comment(1)

Correction/update: Is performance reduced when executing loops whose uop count is not a multiple of processor width? shows that later Sandybridge-family CPUs (at least from Haswell onward) can run a 5 uop loop at closer to 1.25 cycles per iteration, much better than 2 on Sandybridge. SO won't let me edit without updating the goo.gl shortlink to godbolt.org – Brunell 21/2, 2023 at 2:9

SSE2 is optimal when operating on completely parallel data. e.g.

for (int i = 0 ; i < N ; ++i)
    z[i] = _mm_xor_ps(x[i], y[i]);

But in your case, each iteration of the loop depends upon the output of the previous iteration. This is known as a dependency chain. In short, it means that each consecutive xor is going to have to wait for the entire latency of the previous one before it can continue so it lowers the throughput.

Guib answered 29/4, 2014 at 9:33 Comment(12)

I don't see your point. What is not optimal about waiting for the next iteration? How is it related to the parallelization? I compare with exactly the same scalar loop. Doesn't it have to wait the entire latency for the next iteration as well? – Parkin 29/4, 2014 at 9:37

The latency of the xor instruction is 1 cpu clock cycle whereas the latency of xorps is 4 clock cycles. – Guib 29/4, 2014 at 9:43

So he probably should unroll 4 times and have 4 aggregation values instead of one. No need to a result array. – Balmuth 29/4, 2014 at 9:47

@jaket, could you point me to some source of this information may be? So I can't understand it better. Thank you. – Parkin 29/4, 2014 at 9:47

@usr, you mean 64 aggregation values instead of 16, which I have now? – Parkin 29/4, 2014 at 9:48

@Parkin - Agner Fog is the man. agner.org/optimize/optimizing_assembly.pdf - Chapter 9 – Guib 29/4, 2014 at 9:50

@Parkin yeah I was speaking in terms of SSE values (__m128). – Balmuth 29/4, 2014 at 9:51

@Parkin My guess is that you'd get a big performance boost by unrolling your loop to have 4 intermediate sums and then xor'ing them at the end. – Guib 29/4, 2014 at 9:51

@Guib this also looks like a dependency chain to me but all xor instructions have latency 1 (software.intel.com/sites/landingpage/IntrinsicsGuide). Can it still be a dependency chain given this information? – Balmuth 29/4, 2014 at 9:53

@usr. You're right, I was looking at an older processor. The point still stands. With a latency of 1 and a throughput of 0.33 the will have to wait for the latency when there is a dependency chain so there is a 3x slowdown. – Guib 29/4, 2014 at 9:56

@jaket, please see the update of the post. I have improvement of course, but improvement in scalar is bigger than in vector version, so it decreases the speed up. – Parkin 29/4, 2014 at 10:19

@klm: One of the major factors here is a compiler bug (a store instruction in the vector loop). Nehalem likes a bit of loop unrolling, so that's why the updated version does better. However, it would be much simpler to have two or more separate loop-carried dependency chains, instead of having each loop iteration reduce a bunch of vectors down to a single combined vector and xor that into a single accumulator. It does introduce more parallelism, but all those fresh dep chains start with a mov load, not a xor with a memory operand. – Brunell 17/3, 2016 at 20:51

jaket has already explained the likely problem: a dependency chain. I'll give it a try:

template<>
__m128 Sum(const __m128* data, const int N)
{
    __m128 sum1 = _mm_set_ps1(0);
    __m128 sum2 = _mm_set_ps1(0);
    for (int i = 0; i < N; i += 2) {
        sum1 = _mm_xor_ps(sum1, data[i + 0]);
        sum2 = _mm_xor_ps(sum2, data[i + 1]);
    }
    return _mm_xor_ps(sum1, sum2);
}

Now there are no dependencies at all between the two lanes. Try expanding this to more lanes (e.g. 4).

You could also try using the integer version of these instructions (using __m128i). I do not understand the difference so this is just a hint.

Balmuth answered 29/4, 2014 at 10:25 Comment(3)

That might help work around the gcc4 -O2 optimization bug, but otherwise doesn't explain it. _mm_xor_ps is 1c latency. Unrolling will also help (esp. on Nehalem) even without multiple accumulators. Using _mm_xor_si128 with two accumulators should make much better code for later CPUs, though, in theory sustaining two 16B xors per clock. See my answer – Brunell 17/3, 2016 at 20:47

Shouldn't the dependency chain destroy ILP between xor instructions? Normally, multiple of them could be operating concurrently. And isn't that exactly what your answer calls "loop-carried dep chain"? I do like the memory bandwidth analysis and everything else from your answer, though. A +1 is too little for this expert answer. @PeterCordes – Balmuth 18/3, 2016 at 20:18

You get the ILP from having multiple independent dependency chains, each with their own accumulator. This is exactly what your answer does with sum1 and sum2. A sum1 ^= data[i] can be in flight at the same time as a sum2 ^= data[i+1]. This technique is more typically useful with something like FMA, which on Haswell has 5c latency and one per 0.5c throughput, so you need 10 accumulators to keep 10 FMAs in flight, if you're doing a reduction (or anything else with inter-iteration dependencies, as long as the operations are associative so the final answer is the same after reordering) – Brunell 18/3, 2016 at 21:25

In fact, the gcc compiler is optimized for SIMD. It explains why when you used -O2 the perf decreases significantly. You can re-check with -O1.

Contrasty answered 17/3, 2016 at 16:17 Comment(2)

"I use -fno-tree-vectorize to prevent auto vectorization" – Parkin 17/3, 2016 at 17:0

GCC is not optimized for SIMD with dependency chains. Dependency chains is one of the main cases where unrolling with intrinsics is useful with GCC. Clang unrolls four times, ICC unrolls twice (in general but in some cases I have seen more). MSVC may unroll twice but I don't recall. – Trifling 18/3, 2016 at 12:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++