Horizontal XOR in AVX
Asked Answered
O

4

12

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register?

The goal is to get the XOR of all 4 64-bit components of an AVX register. It would essentially be doing the same thing as a horizontal add (_mm256_hadd_epi32()), except that I want to XOR instead of ADD.

The scalar code is:

inline uint64_t HorizontalXor(__m256i t) {
  return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
Oviposit answered 5/7, 2017 at 21:0 Comment(11)
This might be helpful #42041437Cogitable
Nothing built-in, it's easy to implement manually.Grosz
It would probably be faster to do this using non-SIMD instructions. You need three XORs and you're done. (Especially if you want the result in an integer register anyway, which is what the code sample implies.)Showoff
@CodyGray , so is this code good as is? Or can it be faster with some get/extract instructions on the YMM register containing t parameter?Oviposit
Well, how good the code is depends on which compiler you're using. :-) I'm assuming that the use of m256i_u64 means MSVC? (This doesn't compile in GCC or Clang, AFAIK.) And the output in MSVC looks pretty good. Pretty hard to imagine that you could beat a few extracts and moves. Have you profiled that this is actually a bottleneck?Showoff
@CodyGray , yes, it's MSVC++2017 . I'm currently far before profiling phase - in deep implementation. But this horizontal xor is in the heart of a random number generator, so it's expected to be a bottleneck in some use-cases.Oviposit
I'm not really sure how you got yourself into a situation where you need to do horizontal operations in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. Generate the 4 random numbers in 4 different AVX registers.Showoff
@CodyGray, indeed, that's a great idea, thanks! Still an answer to this question may be useful for someone, I think.Oviposit
You've been asking a bunch of good but small x86 questions these days... Clearly you're working on something bigger. It's like a version of an X-Y problem. Maybe you could show us the bigger picture and we can contribute?Underpay
Is your t.m256i_u64[0] etc actually portable? Looks very much like a compiler-specific extension to me. Which compiler?Koons
@IwillnotexistIdonotexist , thanks, I've pushed what I'm doing to github.com/srogatch/ProbQA . It has large cube in its heart: nAnswers * nQuestions * nTargets and a few less-dimensional arrays containing aggregates. I'm currently implementing CPU engine for it (well, it's x86_64 engine only, but I don't plan it for e.g. ARM yet, and supercomputer engine would have its own name), but CUDA and network grid engines are also planned. Mathematically it's based on Bayesian formula and naive Bayes assumption.Oviposit
S
14

As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.

MSVC already generates pretty good code for the scalar function that you show as an example in the question:

inline uint64_t HorizontalXor(__m256i t) {
  return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}

Assuming that t is in ymm1, the resulting disassembly will be something like this:

vextractf128 xmm0, ymm1, 1
vpextrq      rax,  xmm0, 1
vmovq        rcx,  xmm1
xor          rax,  rcx
vpextrq      rcx,  xmm1, 1
vextractf128 xmm0, ymm1, 1
xor          rax,  rcx
vmovq        rcx,  xmm0
xor          rax,  rcx

…with the result left in RAX. If this accurately reflects what you need (a scalar uint64_t result), then this code would be sufficient.

You can slightly improve it by using intrinsics:

inline uint64_t _mm256_hxor_epu64(__m256i x)
{
   const __m128i temp = _mm256_extracti128_si256(x, 1);
   return (uint64_t&)x
          ^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
          ^ (uint64_t&)(temp)
          ^ (uint64_t)(_mm_extract_epi64(temp, 1));
}

Then you'll get the following disassembly (again, assuming that x is in ymm1):

vextracti128 xmm2, ymm1, 1
vpextrq      rcx,  xmm2, 1
vpextrq      rax,  xmm1, 1
xor          rax,  rcx
vmovq        rcx,  xmm1
xor          rax,  rcx
vmovq        rcx,  xmm2
xor          rax,  rcx

Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128 was used instead of VEXTRACTF128 (although, this choice probably does not matter).

You'll see similar output on other compilers. For example, here's GCC 7.1 (with x assumed to be in ymm0):

vextracti128 xmm2, ymm0, 0x1
vpextrq      rax,  xmm0, 1
vmovq        rdx,  xmm2
vpextrq      rcx,  xmm2, 1
xor          rax,  rdx
vmovq        rdx,  xmm0
xor          rax,  rdx
xor          rax,  rcx

The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:

vmovq        rax,  xmm0
vpextrq      rcx,  xmm0, 1
xor          rcx,  rax
vextracti128 xmm0, ymm0, 1
vmovq        rdx,  xmm0
xor          rdx,  rcx
vpextrq      rax,  xmm0, 1
xor          rax,  rdx

And, of course, this ordering is always subject to change when the code gets inlined.


On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:

inline __m256i _mm256_hxor(__m256i x)
{
   const __m128i temp = _mm256_extracti128_si256(x, 1);
   return _mm256_set1_epi64x((uint64_t&)x
                             ^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
                             ^ (uint64_t&)(temp)
                             ^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}

But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.

Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.

If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:

inline __m256i _mm256_hxor(__m256i x)
{
   const __m256i temp = _mm256_xor_si256(x,
                                         _mm256_permute2f128_si256(x, x, 1));    
   return _mm256_xor_si256(temp,
                           _mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}

This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.

If it helps to visualize it, this basically does:

   A         B         C         D           ⟵ input
  XOR       XOR       XOR       XOR
   C         D         A         B           ⟵ permuted input
=====================================
  A^C       B^D       A^C        B^D         ⟵ intermediate result
  XOR       XOR       XOR        XOR
  B^D       A^C       B^D        A^C         ⟵ shuffled intermediate result
======================================
A^C^B^D   A^C^B^D   A^C^B^D    A^C^B^D      ⟵ final result

On practically all compilers, these intrinsics will produce the following assembly code:

vperm2f128  ymm0, ymm1, ymm1, 1    ; input is in YMM1
vpxor       ymm2, ymm0, ymm1
vpshufd     ymm1, ymm2, 78
vpxor       ymm0, ymm1, ymm2

(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)

And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ:

vperm2f128  ymm0, ymm1, ymm1, 1    ; input is in YMM1
vpxor       ymm2, ymm0, ymm1
vpshufd     ymm1, ymm2, 78
vpxor       ymm0, ymm1, ymm2
vmovq       rax,  xmm0

The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.

Showoff answered 6/7, 2017 at 15:44 Comment(5)
Good visual explanation about the XORs!Lithe
For swapping the two lanes of a ymm register, vpermq should be preferred over vperm2i128. It only has one input, which makes it much faster on Ryzen and KNL. They're the same performance on Intel Haswell/Skylake.Rue
Of course, vextracti128 is even better on Ryzen, and 128b operations are only a single uop. If you don't need the result broadcast to all elements, narrowing down to 128b as early as possible is a good strategy for horizontal ops in general, including this one. But vpextrq is relatively expensive in uop count, so it does make sense to shuffle/xor down to a scalar in the bottom of an xmm register, then use one vmovq (to an integer register or to memory). The same applies to other horizontal ops, including integer sums.Rue
completely separate integer execution units: they're on the same ports as the vector execution units in Intel CPUs, except for Haswell and later's port6 which has integer ALUs (and the taken-branch unit) but no vector execution units. So there's only a tiny amount of extra ALU throughput to be gained from mixing in scalar instructions, but it costs a lot of front-end throughput and p0 / p5 uops to get the data to it. (Front-end throughput is a problem for this idea on AMD, too, even though the integer and vector uops use different pipes).Rue
Worth considering: an in-lane vpshufd, vpxor ymm, vextracti128, 2x vmovq, 2x scalar xor. 7 uops total on Intel, and the first vmovq can execute while the 2nd is waiting for the vextracti128 result. On Intel, latency is no better than your final sequence, but it costs more total uops (which need to run in parallel for latency to not be worse). So it can't overlap as well with surrounding code.Rue
L
5

Vectorization is likely to be useful if the input of the horizontal xor-function is already in an AVX register, i.e. your t is the result of some SIMD computation. Otherwise, scalar code is likely to be faster, as already mentioned by @Cody Gray. Often you can do horizontal SIMD operations in about log_2(SIMD_width) 'steps'. In this case one step is a 'shuffle/permute' and a 'xor'. This is slightly more efficient than @Cody Gray 's _mm256_hxor function:

__m256i _mm256_hxor_v2(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);       // swap the 128 bit high and low lane 
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);    // swap 64 bit lanes                         
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return x3;
}

This compiles to:

vperm2i128  $1, %ymm0, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0


If you want the result in a scalar register:

uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}

Which compiles to:

vperm2i128  $1, %ymm0, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vmovq   %xmm0, %rax


Full test code:

#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
/*  gcc -O3 -Wall -m64 -march=broadwell hor_xor.c   */
int print_vec_uint64(__m256i v);

__m256i _mm256_hxor_v2(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
/* Uncomment the next few lines to print the values of the intermediate variables */ 
/*
    printf("3...0        =          3          2          1          0\n");
    printf("x            = ");print_vec_uint64(x        );
    printf("x0           = ");print_vec_uint64(x0        );
    printf("x1           = ");print_vec_uint64(x1        );
    printf("x2           = ");print_vec_uint64(x2        );
    printf("x3           = ");print_vec_uint64(x3        );
*/
    return x3;
}

uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}


int main() {
    __m256i x = _mm256_set_epi64x(0x7, 0x5, 0x2, 0xB);
//    __m256i x = _mm256_set_epi64x(4235566778345231, 1123312566778345423, 72345566778345673, 967856775433457);

    printf("x            = ");print_vec_uint64(x);

    __m256i y = _mm256_hxor_v2(x);

    printf("y            = ");print_vec_uint64(y);

    uint64_t z = _mm256_hxor_v2_uint64(x);

    printf("z =  %10lX  \n",z);

    return 0;
}


int print_vec_uint64(__m256i v){
    uint64_t t[4];
    _mm256_storeu_si256((__m256i *)t,v);
    printf("%10lX %10lX %10lX %10lX \n",t[3],t[2],t[1],t[0]);
    return 0;
}
Lithe answered 6/7, 2017 at 22:40 Comment(6)
Indeed, my original solution was sub-optimal. I posted the answer right before turning in for the night, and then on my way to bed realized a better solution. Having come back to update, I see that you had already posted it. I went ahead and updated my answer for completeness, but have an upvote!Showoff
@CodyGray Loosely speaking, the SIMD complexity of 'simple' horizontal operations, such as horizontal sum, product, minimum, maximum, logical and, etc. is often O(log(n)) instead of O(n), where n is the number of elements in the SIMD register. Sometimes this is quite obvious, for example with the horizontal minimum. Sometimes it is less obvious, such as this one.Lithe
Most of my comments on Cody's update apply here too: reduce down to 128b as the first step (faster on Ryzen and Excavator), and avoid vperm2i128 when you don't need it. vextracti128 is excellent on Ryzen, and vpermq is better than vperm2?128 for swapping upper/lower lanes.Rue
When you do want a result broadcast to every element instead of reducing to 128 and then scalar, doing the in-lane shuffle first is probably slightly better, since the lower latency means more uops/instructions can execute (and retire) sooner, freeing up space in the reservation station and ROB. It's probably non-trivial to even construct an artificial test that could measure the difference, but I think it can't hurt. This also applies when reducing to scalar, but in that case staying 256b for longer means extra uops on AMD CPUs, so I'd recommend reducing to 128b first.Rue
@PeterCordes Thanks for your insightful comments! I didn't even think about Ryzen when I wrote my answer. I'll update my answer later on.Lithe
:) Even without Ryzen, presumably narrowing to 128b ASAP has energy/power advantages. Might be more relevant for FP add than for XOR, but still very small. Also, a speed advantage on e.g. Skylake if the CPU is still in AVX "warm-up" mode where the upper lane isn't active yet.Rue
M
2

Implementation of direct analogue of _mm256_hadd_epi32() for XOR will be look something like this:

#include <immintrin.h>

template<int imm> inline __m256i _mm256_shuffle_epi32(__m256i a, __m256i b)
{
    return _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), imm));
}

inline __m256i _mm256_hxor_epi32(__m256i a, __m256i b)
{
    return _mm256_xor_si256(_mm256_shuffle_epi32<0x88>(a, b), _mm256_shuffle_epi32<0xDD>(a, b));
}

int main()
{
    __m256i a = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    __m256i b = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
    __m256i c = _mm256_hxor_epi32(a, b);
    return 0;
}
Measure answered 6/7, 2017 at 7:24 Comment(2)
I've edited the question to clarify the goal. Please, look. Also I'm afraid the code like above is slower than to just XOR 64-bit components of __m256i register: 4 components need 3 scalar XOR operations.Oviposit
@SergeRogatch Could you write scalar code that you want to optimize with using AVX?Measure
C
2

Here are some examples of horizontal XOR on the 4 64-bit components of a YMM register.

The last one, hxor_epu64_v3c, is the most efficient for most cases: smallest code size and at least tied for fewest uops on most CPUs. And it narrows to 128-bit right away, which is good on Zen 1 and Intel E-cores where 256-bit instructions cost 2 uops. (https://uops.info/)

#include <stdint.h>
#include <immintrin.h>

uint64_t mm256_hxor_epu64_v1(__m256i a) {
    return _mm256_extract_epi64(a, 0) ^ _mm256_extract_epi64(a, 1) ^
           _mm256_extract_epi64(a, 2) ^ _mm256_extract_epi64(a, 3);
}

uint64_t mm256_hxor_epu64_v2a(__m256i a) {
    a = _mm256_xor_si256(a, _mm256_bsrli_epi128(a, 8));
    return _mm256_extract_epi64(a, 0) ^ _mm256_extract_epi64(a, 2);
}

uint64_t mm256_hxor_epu64_v2b(__m256i a) {
    a = _mm256_xor_si256(a, _mm256_unpackhi_epi64(a, a));
    return _mm256_extract_epi64(a, 0) ^ _mm256_extract_epi64(a, 2);
}

uint64_t mm256_hxor_epu64_v3a(__m256i a) {
    const __m128i tmp = _mm_xor_si128(_mm256_extracti128_si256(a, 0),
                                      _mm256_extracti128_si256(a, 1));
    return _mm_extract_epi64(_mm_xor_si128(tmp, _mm_bsrli_si128(tmp, 8)), 0);
}

uint64_t mm256_hxor_epu64_v3b(__m256i a) {
    const __m128i tmp = _mm_xor_si128(_mm256_extracti128_si256(a, 0),
                                      _mm256_extracti128_si256(a, 1));
    return _mm_extract_epi64(tmp, 1) ^ _mm_extract_epi64(tmp, 0);
}

uint64_t mm256_hxor_epu64_v3c(__m256i a) {
    const __m128i tmp = _mm_xor_si128(_mm256_extracti128_si256(a, 0),
                                      _mm256_extracti128_si256(a, 1));
    return _mm_extract_epi64(_mm_xor_si128(tmp, _mm_unpackhi_epi64(tmp, tmp)), 0);
}

godbolt

mm256_hxor_epu64_v1(long long __vector(4)):
        vmovq   rsi, xmm0
        vpextrq rax, xmm0, 1
        vextracti128    xmm0, ymm0, 0x1
        vmovq   rcx, xmm0
        xor     rax, rsi
        vpextrq rdx, xmm0, 1
        xor     rax, rcx
        xor     rax, rdx
        ret
mm256_hxor_epu64_v2a(long long __vector(4)):
        vmovdqa ymm1, ymm0
        vpsrldq ymm0, ymm0, 8
        vpxor   ymm0, ymm0, ymm1
        vmovq   rdx, xmm0
        vextracti128    xmm0, ymm0, 0x1
        vmovq   rax, xmm0
        xor     rax, rdx
        ret
mm256_hxor_epu64_v2b(long long __vector(4)):
        vpunpckhqdq     ymm1, ymm0, ymm0
        vpxor   ymm0, ymm0, ymm1
        vmovq   rdx, xmm0
        vextracti128    xmm0, ymm0, 0x1
        vmovq   rax, xmm0
        xor     rax, rdx
        ret
mm256_hxor_epu64_v3a(long long __vector(4)):
        vextracti128    xmm1, ymm0, 0x1
        vpxor   xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        ret
mm256_hxor_epu64_v3b(long long __vector(4)):
        vextracti128    xmm1, ymm0, 0x1
        vpxor   xmm0, xmm1, xmm0
        vpsrldq xmm1, xmm0, 8
        vpxor   xmm0, xmm1, xmm0
        vmovq   rax, xmm0
        ret
mm256_hxor_epu64_v3c(long long __vector(4)):
        vextracti128    xmm1, ymm0, 0x1
        vpxor   xmm0, xmm0, xmm1
        vpunpckhqdq     xmm1, xmm0, xmm0
        vpxor   xmm0, xmm1, xmm0
        vmovq   rax, xmm0
        ret

GCC is unfortunately doing a useless vmovdqa xmm1, xmm0 as the first instruction instead of extracting the high 128 bits into a different register. That might or might not happen when inlined into a caller, or with other GCC versions.

Cedar answered 4/6, 2024 at 2:21 Comment(6)
_mm256_hxor_epu64_v3 is clearly the most efficient: following the standard pattern for associative horizontal reductions (Fastest way to do horizontal SSE vector sum (or other reduction)) of extracting the high half to narrow the vector. As your asm listings show, _mm256_extract_epi64(a, 2) is not efficient: it costs a vextracti128 plus a vmovq, since unfortunately there is no vpextrq r64, ymm, imm8, only from an XMM. (But even that is 2 uops on Intel, so v1 isn't great either).Rue
IDK if it would ever be worth modifying v3 to use _mm_extract_epi64(tmp, 1) ^ _mm_extract_epi64(tmp, 0) (vpextrq + vmovq + scalar xor). That's the same instruction count but more uops than shuffle + vpxor + vmovq. If you can use the FLAGS result from the scalar xor (e.g. branching on whether it's zero or negative or something), it could be worth it.Rue
You could save a byte of code size in v3 with _mm_unpackhi_epi64(tmp,tmp) instead of _mm_bsrli_si128(tmp, 8). Both ways make a vector whose low element is the high half of the input, but vpunpckhqdq doesn't need an immediate operand. vpunpckhqdq can also run on more ports on Zen 4 (uops.info) than vpsrldq, equal on Intel Skylake / Ice Lake / Alder Lake.Rue
@PeterCordes I renamed v2 to v2a and created v2b (which uses unpackhi instead of bsrli). I renamed v3 to v3a and created v3b (which uses extract instead of bsrli) and created v3c (which uses unpackhi instead of bsrli). 3a and 3b appear identical in the asm.Cedar
GCC auto-vectorized the extract() ^ extract() into shuffle / vpxor / vmovq. It picked a slightly less efficient shuffle (vpsrldq instead of vpunpckhqdq) - it's one byte longer but IIRC runs the same on CPUs which support AVX (which this function requires). uops.info.Rue
Note that ^ on __m128i types is a GNU extension, not portable to MSVC. You're using that in most of your functions.Rue

© 2022 - 2025 — McMap. All rights reserved.