I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics.
Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.)
Can you recommend me a short substitute ?
UPDATE:
_mm256_slli_si256
is efficiently achieved with
_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)
or
_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)
for shifts larger than 16 bytes.
But the question remains for _mm256_srli_si256
.
_mm256_alignr_epi8
instruction. Unfortunately, there is no_mm256_alignl_epi8
correspondence. – Capua_mm256_alignl_epi8
(which is why there is no instruction or intrinsic for this) -_mm256_alignr_epi8
works for both left and right shift cases (just switch the arguments and adjust the shift value). – Poul