I have large in-memory array as some pointer uint64_t * arr
(plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.
By shifting whole array I mean not to shift each element (like a[i] <<= Shift
), but to shift it as a single large bit vector. In other words for each intermediate position i
(except for first and last element) I can do following in a loop:
dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);
where w
is some temporary variable, holding right-shifted value of previous array element.
This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.
Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.
But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.
Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg
instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.
BTW, I need solutions for all of MSVC/GCC/CLang.
Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.
If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).
In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.
Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X
, and several hundreds of shift sizes s0, s1, ..., sN
, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN)
. I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.
As requested by @Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src
to produced or-ed final dst
bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:
#include <cstdint>
#include <vector>
#include <random>
#define bit_sizeof(x) (sizeof(x) * 8)
using u64 = uint64_t;
using T = u64;
int main() {
std::mt19937_64 rng{123};
// Random generate source bit vector
std::vector<T> src(100'000);
for (size_t i = 0; i < src.size(); ++i)
src[i] = rng();
size_t const src_bitsize = src.size() * bit_sizeof(T);
// Destination bit vector, for example twice bigger in size
std::vector<T> dst(src.size() * 2);
// Random generate shifts
std::vector<u64> shifts(200);
for (size_t i = 0; i < shifts.size(); ++i)
shifts[i] = rng() % src_bitsize;
// Right-shift that handles overflow
auto Shr = [](auto x, size_t s) {
return s >= bit_sizeof(x) ? 0 : (x >> s);
};
// Do actual Shift-Ors
for (auto orig_shift: shifts) {
size_t const
word_off = orig_shift / bit_sizeof(T),
bit_off = orig_shift % bit_sizeof(T);
if (word_off >= dst.size())
continue;
size_t const
lim = std::min(src.size(), dst.size() - word_off);
T w = 0;
for (size_t i = 0; i < lim; ++i) {
dst[word_off + i] |= w | (src[i] << bit_off);
w = Shr(src[i], bit_sizeof(T) - bit_off);
}
// Special case of handling for last word
if (word_off + lim < dst.size())
dst[word_off + lim] |= w;
}
}
My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP
to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.
X
, and several shift sizess0, s1, ..., sN
then I want to computeY = (X << s0) | (X << s1) | ... | (X << sN)
. – Durnompn
lshift / rshift functions: gmplib.org/repo/gmp/file/tip/mpn/x86_64/fastsse/… - you could use GMP (GPL license) or maybe translate their loop into intrinsincs. – Pellucidconstexpr
value inside each function, it is template parameter. Then I created array of 64 instances ofstd::function
(in other words functions pointers) and just call one of these 64 functions depending on shift size. Through doing this way I can have shift size as compile timeconstexpr
value, which will optimize instructions, and also can do specialized variants of functions for shift sizes multiple of 8. – DurnoY = (X << s0) | (X << s1) | ... | (X << sN)
, whereX
is large input bit vector,Y
is large output bit vector ands0, ..., sN
is a hundred of shift sizes from 0 to 1M. And I do this computation in two loops, outer loop goes through allsK
, and inner loop does actual shift-or the way I showed in the code at the beginning of my question. Nothing else. This is not-optimized solution. Hence I'm looking for all suggestions regarding what can be done to optimize it. – Durnoor
, right? – Rogersonshld
is more efficient with constant shift counts thancl
on Intel CPUs, and also AMD althoughshld
in general is not very efficient on AMD, more of a win to use SIMD. You may want to only use 128-bit vectors, especially if AVX-512 isn't available. AVX2 has higher latency for lane-crossing shifts, and the available lane-crossing shifts are more limited. – Pellucidvpsllvq xmm, xmm, xmm
(with a per-element count) is faster thanvpsllq xmm, xmm, xmm
(using the low element of the last operand as the count), presumably implemented as broadcast + SIMD-variable-shift uops. On Zen, either way is efficient. On all Intel, shift vector by vector is still 2 uops, but per-element variable shift is even worse at 3 instead of 1 on Haswell (the first with that instruction). agner.org/optimize uops.info. But Haswell / Broadwell are pretty old by now, so if you require AVX2 at all, you might broadcast shift counts for vpsllvq. – PellucidOpenMP
. – Durnosrc[i] >> (64-Shift)
I think. (If you special-case mod8 shift counts such as zero, you don't have to deal with 64-0 = 64 out-of-range shift count undefined behaviour on uint64_t elements.) – Pellucid64-Shift
right-shift already in my project, I just forgot it here when was writing this post. My project already works and solves real-world task correctly without bugs. I just need optimizations. – Durnovpalignr
/vperm2i128
) – Pellucidvpalignr
is pretty much useless for this ...) This usually only works if the input comes directly from memory, of course. – Diagram__always_aligned__
likeuint64_t __always_aligned(64)__ * ptr = get_ptr();
, that says that my functionget_ptr()
will always return pointer such that its address% 64 == 0
. Is there such attribute? Because mystd::vector
is always 64-bytes aligned, I wrote special aligned allocator for it. So there is no point in un-aligned SIMD load/store in auto-vectorization of GCC. – Durno__builtin_assume_aligned
for GCC – Ickes