As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR
instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.
MSVC already generates pretty good code for the scalar function that you show as an example in the question:
inline uint64_t HorizontalXor(__m256i t) {
return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
Assuming that t
is in ymm1
, the resulting disassembly will be something like this:
vextractf128 xmm0, ymm1, 1
vpextrq rax, xmm0, 1
vmovq rcx, xmm1
xor rax, rcx
vpextrq rcx, xmm1, 1
vextractf128 xmm0, ymm1, 1
xor rax, rcx
vmovq rcx, xmm0
xor rax, rcx
…with the result left in RAX
. If this accurately reflects what you need (a scalar uint64_t
result), then this code would be sufficient.
You can slightly improve it by using intrinsics:
inline uint64_t _mm256_hxor_epu64(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return (uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1));
}
Then you'll get the following disassembly (again, assuming that x
is in ymm1
):
vextracti128 xmm2, ymm1, 1
vpextrq rcx, xmm2, 1
vpextrq rax, xmm1, 1
xor rax, rcx
vmovq rcx, xmm1
xor rax, rcx
vmovq rcx, xmm2
xor rax, rcx
Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128
was used instead of VEXTRACTF128
(although, this choice probably does not matter).
You'll see similar output on other compilers. For example, here's GCC 7.1 (with x
assumed to be in ymm0
):
vextracti128 xmm2, ymm0, 0x1
vpextrq rax, xmm0, 1
vmovq rdx, xmm2
vpextrq rcx, xmm2, 1
xor rax, rdx
vmovq rdx, xmm0
xor rax, rdx
xor rax, rcx
The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:
vmovq rax, xmm0
vpextrq rcx, xmm0, 1
xor rcx, rax
vextracti128 xmm0, ymm0, 1
vmovq rdx, xmm0
xor rdx, rcx
vpextrq rax, xmm0, 1
xor rax, rdx
And, of course, this ordering is always subject to change when the code gets inlined.
On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:
inline __m256i _mm256_hxor(__m256i x)
{
const __m128i temp = _mm256_extracti128_si256(x, 1);
return _mm256_set1_epi64x((uint64_t&)x
^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
^ (uint64_t&)(temp)
^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}
But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.
Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.
If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:
inline __m256i _mm256_hxor(__m256i x)
{
const __m256i temp = _mm256_xor_si256(x,
_mm256_permute2f128_si256(x, x, 1));
return _mm256_xor_si256(temp,
_mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}
This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.
If it helps to visualize it, this basically does:
A B C D ⟵ input
XOR XOR XOR XOR
C D A B ⟵ permuted input
=====================================
A^C B^D A^C B^D ⟵ intermediate result
XOR XOR XOR XOR
B^D A^C B^D A^C ⟵ shuffled intermediate result
======================================
A^C^B^D A^C^B^D A^C^B^D A^C^B^D ⟵ final result
On practically all compilers, these intrinsics will produce the following assembly code:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)
And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ
:
vperm2f128 ymm0, ymm1, ymm1, 1 ; input is in YMM1
vpxor ymm2, ymm0, ymm1
vpshufd ymm1, ymm2, 78
vpxor ymm0, ymm1, ymm2
vmovq rax, xmm0
The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.
XOR
s and you're done. (Especially if you want the result in an integer register anyway, which is what the code sample implies.) – ShowoffYMM
register containingt
parameter? – Ovipositm256i_u64
means MSVC? (This doesn't compile in GCC or Clang, AFAIK.) And the output in MSVC looks pretty good. Pretty hard to imagine that you could beat a few extracts and moves. Have you profiled that this is actually a bottleneck? – Showofft.m256i_u64[0]
etc actually portable? Looks very much like a compiler-specific extension to me. Which compiler? – KoonsnAnswers
*nQuestions
*nTargets
and a few less-dimensional arrays containing aggregates. I'm currently implementing CPU engine for it (well, it's x86_64 engine only, but I don't plan it for e.g. ARM yet, and supercomputer engine would have its own name), but CUDA and network grid engines are also planned. Mathematically it's based on Bayesian formula and naive Bayes assumption. – Oviposit