Efficiently shift-or large bit vector

Asked 20/11, 2021 at 7:1 Answered 20/11, 2021 at 13:7

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.

By shifting whole array I mean not to shift each element (like a[i] <<= Shift), but to shift it as a single large bit vector. In other words for each intermediate position i (except for first and last element) I can do following in a loop:

dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);

where w is some temporary variable, holding right-shifted value of previous array element.

This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.

Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.

But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.

Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.

BTW, I need solutions for all of MSVC/GCC/CLang.

Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.

If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X, and several hundreds of shift sizes s0, s1, ..., sN, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN). I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.

As requested by @Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src to produced or-ed final dst bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:

Try it online!

#include <cstdint>
#include <vector>
#include <random>

#define bit_sizeof(x) (sizeof(x) * 8)

using u64 = uint64_t;
using T = u64;

int main() {
    std::mt19937_64 rng{123};

    // Random generate source bit vector
    std::vector<T> src(100'000);
    for (size_t i = 0; i < src.size(); ++i)
        src[i] = rng();

    size_t const src_bitsize = src.size() * bit_sizeof(T);

    // Destination bit vector, for example twice bigger in size
    std::vector<T> dst(src.size() * 2);

    // Random generate shifts
    std::vector<u64> shifts(200);
    for (size_t i = 0; i < shifts.size(); ++i)
        shifts[i] = rng() % src_bitsize;

    // Right-shift that handles overflow
    auto Shr = [](auto x, size_t s) {
        return s >= bit_sizeof(x) ? 0 : (x >> s);
    };

    // Do actual Shift-Ors
    for (auto orig_shift: shifts) {
        size_t const
            word_off = orig_shift / bit_sizeof(T),
            bit_off = orig_shift % bit_sizeof(T);

        if (word_off >= dst.size())
            continue;
        
        size_t const
            lim = std::min(src.size(), dst.size() - word_off);

        T w = 0;
        
        for (size_t i = 0; i < lim; ++i) {
            dst[word_off + i] |= w | (src[i] << bit_off);
            w = Shr(src[i], bit_sizeof(T) - bit_off);
        }

        // Special case of handling for last word
        if (word_off + lim < dst.size())
            dst[word_off + lim] |= w;
    }
}

My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C++ function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.

Durno answered 20/11, 2021 at 7:1 Comment(32)

If you have gigabytes of data, aren't you memory-bound anyway? And shouldn't you be considering merging this operation with the previous or next one, unless both of those are somehow restricted by hardware or so? – Endorsee 20/11, 2021 at 7:11

@DavisHerring If I'm memory bound with single shift of vector, then at least maybe I'm not memory bound if I have to shift-or several vectors? For example if I have large bit vector X, and several shift sizes s0, s1, ..., sN then I want to compute Y = (X << s0) | (X << s1) | ... | (X << sN). – Durno 20/11, 2021 at 7:15

Have a look at GMP's low-level mpn lshift / rshift functions: gmplib.org/repo/gmp/file/tip/mpn/x86_64/fastsse/… - you could use GMP (GPL license) or maybe translate their loop into intrinsincs. – Pellucid 20/11, 2021 at 7:15

Obviously if the shift-count is a multiple of 8, this becomes memmove + memset which can be very fast on modern CPUs that handle unaligned SIMD loads/stores efficiently. – Pellucid 20/11, 2021 at 7:18

@PeterCordes Unfortunately, I have arbitrary shift sizes. Of course I can have if-branch for the case of multiply of 8 shifts. Please see my Note, I just added this note to the end of my question, it tells about original non-simplified task. – Durno 20/11, 2021 at 7:20

Ok, yeah, you'll want to shift on the fly as your read inputs in groups of maybe 4 to 8, not actually store each bit-aligned temporary array back to memory (unless you're going to reuse the same shift-count soon). – Pellucid 20/11, 2021 at 7:20

And yeah you'll probably want to incorporate a way to handle the special %8 case for some inputs, maybe by grouping together the pointers that go with the no-shift case? Or no, you want to overlap ALU and memory, so maybe have a version of a 4 or 5-input function that allows one of them to be unshifted so you just offset the pointer. – Pellucid 20/11, 2021 at 7:24

Hmm, maybe also look for two shift counts the same modulo 8, and OR them before shifting. (Preferably still on the fly, although this would start to balloon the number of versions of the function you need even with sorting, without JITing.) So again, maybe a version of the function where the first 2 inputs share a shift count (but allow different lengths so you can offset their pointers). Or if you sort before batching up into groups of 4 input streams, maybe it's likely that you'll have more than 2 shift counts the same. Updating one of them in-place un-shifted will keep it hot in cache – Pellucid 20/11, 2021 at 7:28

@PeterCordes Yes, exactly I did like this already. I created C++ templated function specialization for all 64 shift sizes, so that shift size is constexpr value inside each function, it is template parameter. Then I created array of 64 instances of std::function (in other words functions pointers) and just call one of these 64 functions depending on shift size. Through doing this way I can have shift size as compile time constexpr value, which will optimize instructions, and also can do specialized variants of functions for shift sizes multiple of 8. – Durno 20/11, 2021 at 7:28

@You don't need AVX for this task. Find the LCM of shift and 64, and do multi-threading. The function will be bandwidth limited even at two threads. – Rogerson 20/11, 2021 at 7:31

@PeterCordes Nice idea with grouping two shift sizes that are equal modulo 8... BTW, all the multiple-of-8 optimizations will save me just 1/8 of time (12.5%), because I have random uniformly distributed shift sizes. – Durno 20/11, 2021 at 7:31

@Jake'Alquimista'LEE Actually if you see my new Note in the Question, then you can notice that I need serveral Shift-Or operations on same input vector. Probably in this case, if I have many shift sizes, for example hundreds, then I will be still CPU bound, not memory bound. Because each final word in resulting vector has to be computed based on OR of hundreds of words, which is a CPU bound computation. – Durno 20/11, 2021 at 7:34

@Durno Why don't you post the whole algorithm then? – Rogerson 20/11, 2021 at 7:36

@Jake'Alquimista'LEE Right now C++ algorithm is simple, I just compute Y = (X << s0) | (X << s1) | ... | (X << sN), where X is large input bit vector, Y is large output bit vector and s0, ..., sN is a hundred of shift sizes from 0 to 1M. And I do this computation in two loops, outer loop goes through all sK, and inner loop does actual shift-or the way I showed in the code at the beginning of my question. Nothing else. This is not-optimized solution. Hence I'm looking for all suggestions regarding what can be done to optimize it. – Durno 20/11, 2021 at 7:40

@Jake'Alquimista'LEE I don't post algorithm itself, because it is a part of bigger program, and it does different unrelated stuff in the middle, in other words I don't have ready-made example that can be shared. If I post whole my code then it is thousands of lines. Otherwise I have to implement again from scratch non-optimal clean minimal reproducible example here specially for my Question, which I didn't implement. – Durno 20/11, 2021 at 7:42

@Durno In other words, you need to do 63 times or, right? – Rogerson 20/11, 2021 at 7:45

You're going to need runtime-variable shift counts if you support groups of multiple vectors, otherwise the combinatorial explosion of template versions would be too much. For just one, yeah scalar shld is more efficient with constant shift counts than cl on Intel CPUs, and also AMD although shld in general is not very efficient on AMD, more of a win to use SIMD. You may want to only use 128-bit vectors, especially if AVX-512 isn't available. AVX2 has higher latency for lane-crossing shifts, and the available lane-crossing shifts are more limited. – Pellucid 20/11, 2021 at 7:45

On Skylake, vpsllvq xmm, xmm, xmm (with a per-element count) is faster than vpsllq xmm, xmm, xmm (using the low element of the last operand as the count), presumably implemented as broadcast + SIMD-variable-shift uops. On Zen, either way is efficient. On all Intel, shift vector by vector is still 2 uops, but per-element variable shift is even worse at 3 instead of 1 on Haswell (the first with that instruction). agner.org/optimize uops.info. But Haswell / Broadwell are pretty old by now, so if you require AVX2 at all, you might broadcast shift counts for vpsllvq. – Pellucid 20/11, 2021 at 7:46

@Jake'Alquimista'LEE No, not 64 times or. See my Note in question. I have hundreds of shift sizes, each shift size is from 0 to 1M. It means I have to do hundreds of ORs. But because each large shift can be represented as 64-bit-word offset plus shift of 0 to 63 then I should do only shifts of 64 sizes, but still need hundreds of them plus ORs. – Durno 20/11, 2021 at 7:48

@Durno then post a working loop, not in words!!! Post exactly what you need to do. – Rogerson 20/11, 2021 at 7:48

@Jake'Alquimista'LEE OK, I'll implement a toy example and post it. – Durno 20/11, 2021 at 7:50

@Jake'Alquimista'LEE Just added to the very end of my Question a simple toy example algorithm, it is not optimized at all. It is close (but not identical) to what I have already in my project. In my real project I already did some optimizations, for example created array of 64 functions, representing inner of two loops, each specialized for compile time value of shift-size. Also already I did parallelization through OpenMP. – Durno 20/11, 2021 at 8:20

You right-shift count actually needs to be src[i] >> (64-Shift) I think. (If you special-case mod8 shift counts such as zero, you don't have to deal with 64-0 = 64 out-of-range shift count undefined behaviour on uint64_t elements.) – Pellucid 20/11, 2021 at 8:22

@PeterCordes, thanks, corrected my Question. I actually do 64-Shift right-shift already in my project, I just forgot it here when was writing this post. My project already works and solves real-world task correctly without bugs. I just need optimizations. – Durno 20/11, 2021 at 8:25

GCC has no problem auto-vectorizing this for compile-time shift counts: godbolt.org/z/PKbhqjdM4 – Diagram 20/11, 2021 at 8:26

@chtz: Ah, interesting strategy from GCC, unaligned loads offset from each other by 8 bytes, allowing efficient use of 256-bit vectors. (For this shift of less than 64 bits). (vs clang doing it the hard way, with vpalignr / vperm2i128) – Pellucid 20/11, 2021 at 8:28

@PeterCordes with AVX2 I always try to do interlane-shifts by offset loads (vpalignr is pretty much useless for this ...) This usually only works if the input comes directly from memory, of course. – Diagram 20/11, 2021 at 8:39

@PeterCordes Do you think that writing manual SIMD intrinsics code here is worth doing? Or GCC's auto-vectorization with loop unrollment is quite enough and will be about the same speed as most advanced manual intrinsics code? – Durno 20/11, 2021 at 10:9

@Arty: If you're only ever going to compile with recent GCC versions, that's probably fine, as long as you get similarly good asm once you start writing versions that use the same shift count for multiple vectors. (If you have hundreds of inputs, many of them will have the same shift count %8 or even %16, so you should group those together). – Pellucid 20/11, 2021 at 10:25

@PeterCordes One more question - is it possible somehow to mark certain pointer as guaranteed to be aligned? So that compiler may create aligned load/store instructions for SIMD. For example some imaginary attribute __always_aligned__ like uint64_t __always_aligned(64)__ * ptr = get_ptr();, that says that my function get_ptr() will always return pointer such that its address % 64 == 0. Is there such attribute? Because my std::vector is always 64-bytes aligned, I wrote special aligned allocator for it. So there is no point in un-aligned SIMD load/store in auto-vectorization of GCC. – Durno 20/11, 2021 at 11:5

@Durno you can use __builtin_assume_aligned for GCC – Ickes 20/11, 2021 at 13:32

@MarcStevens If you're interested, I created a separate question regarding this alignment task. – Durno 20/11, 2021 at 13:36

You can, and possibly you don't even need to use SIMD instructions explicitly. The target compilers GCC, CLANG and MSVC and other compilers like ICC all support auto-vectorization. While hand-optimized assembly can outperform compiler generated vectorized instructions, it's generally harder to achieve and you may need several versions for different architectures. Generic code that leads to efficient auto-vectorized instructions is a solution that may be portable across many platforms.

For instance a simple shiftvec version

void shiftvec(uint64_t* dst, uint64_t* src, int size, int shift)
{
    for (int i = 0; i < size; ++i,++src,++dst)
    {
        *dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
    }
}

compiled with a recent GCC (or CLANG works as well) and -O3 -std=c++11 -mavx2 leads to SIMD instructions in the core loop of the assembly

.L5:
  vmovdqu ymm4, YMMWORD PTR [rsi+rax]
  vmovdqu ymm5, YMMWORD PTR [rsi+8+rax]
  vpsllq ymm0, ymm4, xmm2
  vpsrlq ymm1, ymm5, xmm3
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rdi+rax], ymm0
  add rax, 32
  cmp rax, rdx
  jne .L5

See on godbolt.org: https://godbolt.org/z/5TxhqMhnK

This also generalizes if you want to do combine multiple shifts in dst:

void shiftvec2(uint64_t* dst, uint64_t* src1, uint64_t* src2, int size1, int size2, int shift1, int shift2)
{
    int size = size1<size2 ? size1 : size2;
    for (int i = 0; i < size; ++i,++src1,++src2,++dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));
        *dst |= ((*src2)<<shift2) | (*(src2+1)>>(64-shift2)); 
    }
    for (int i = size; i < size1; ++i,++src1,++dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1+1)>>(64-shift1));        
    }
    for (int i = size; i < size2; ++i,++src2,++dst)
    {
        *dst = ((*src2)<<shift2) | (*(src2+1)>>(64-shift2));
    }
}

compiles to a core-loop:

.L38:
  vmovdqu ymm7, YMMWORD PTR [rsi+rcx]
  vpsllq ymm1, ymm7, xmm4
  vmovdqu ymm7, YMMWORD PTR [rsi+8+rcx]
  vpsrlq ymm0, ymm7, xmm6
  vpor ymm1, ymm1, ymm0
  vmovdqu YMMWORD PTR [rax+rcx], ymm1
  vmovdqu ymm7, YMMWORD PTR [rdx+rcx]
  vpsllq ymm0, ymm7, xmm3
  vmovdqu ymm7, YMMWORD PTR [rdx+8+rcx]
  vpsrlq ymm2, ymm7, xmm5
  vpor ymm0, ymm0, ymm2
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rax+rcx], ymm0
  add rcx, 32
  cmp r10, rcx
  jne .L38

Combining multiple sources in one loop will reduce the total amount of memory bandwidth spent on loading/writing the destination. The limit in how many you can combine is of course limited by available registers. Note that xmm2 and xmm3 for shiftvec contain the shift values, so having different versions for compile-time known shift values may free those registers.

Additionally using __restrict (supported by GCC,CLANG,MSVC) for each of the pointers will tell the compiler that the ranges are not overlapping.

I initially had problems with MSVC giving proper auto vectorized code, but it seems adding more SIMD-like structure will make it work for all three desired compilers GCC, CLANG and MSVC:

void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift)
{
    int i = 0;
    // MSVC: use steps of 2 for SSE, 4 for AVX2, 8 for AVX512
    for (; i+4 < size; i+=4,dst+=4,src+=4)
    {
        for (int j = 0; j < 4; ++j)
            *(dst+j) = (*(src+j))<<shift;
        for (int j = 0; j < 4; ++j)
            *(dst+j) |= (*(src+1)>>(64-shift));
    }
    for (; i < size; ++i,++src,++dst)
    {
        *dst = ((*src)<<shift) | (*(src+1)>>(64-shift));
    }    
}

Ickes answered 20/11, 2021 at 8:34 Comment(5)

Unfortunately whereas MSVC does auto-vectorizing, can be told that pointers don't alias, and can even generate check for aliasing and generate two loop versions (aliased and vectorized), the vectorization capabilities themselves are very limited, it cannot vectorize this shift. (An option could be to compile this function with clang-cl and take advantage of binary compatibility of clang-cl and cl). – Venable 20/11, 2021 at 11:47

Yeah, I've tried to get MSVC to auto vectorize the code above. Weirdly godbolt shows for me that it works for msvc x86, but not for msvc x64. In the x64 case using /Qvec-report:2 gives auto vectorization failure 1200. – Ickes 20/11, 2021 at 11:51

Interesting. It also fails to vectorize on x86 for 32-bit type godbolt.org/z/fTo3oT5n1 . I've reported missing optimization, maybe they'll do something with this someday developercommunity.visualstudio.com/t/… – Venable 20/11, 2021 at 12:2

I can confirm that even MSVC can autovectorize it properly using somewhat more SIMD-like structured code: ``` void shiftvec(uint64_t* __restrict dst, const uint64_t* __restrict src, int size, int shift) { int i = 0; for (; i+4 < size; i+=4,dst+=4,src+=4) { for (int j = 0; j < 4; ++j) *(dst+j) = (*(src+j))<<shift; for (int j = 0; j < 4; ++j) *(dst+j) |= (*(src+1)>>(64-shift)); } for (; i < size; ++i,++src,++dst) { *dst = ((*src)<<shift) | (*(src+1)>>(64-shift)); } } ``` – Ickes 20/11, 2021 at 12:29

@AlexGuteniev: If you want good asm across compilers, take a good vectorization strategy from GCC (in this case), or clang, and convert that back to intrinsics. – Pellucid 20/11, 2021 at 18:57

I would attempt to rely on x64 ability to read from unaligned addresses, and to do that with almost no visible penalty when stars are properly (un)aligned. One would only need to handle a few cases of (shift % 8) or (shift % 16) -- all doable with SSE2 instruction set, fixing the remainder with zeros and having an unaligned offset to the data vector and addressing the UB by memcpy.

That said, the inner loop would look like:

uint16_t const *ptr;
auto a = _mm_loadu_si128((__m128i*)ptr);
auto b = _mm_loadu_si128((__m128i*)(ptr - 1);
a = _mm_srl_epi16(a, c);
b = _mm_sll_epi16(b, 16 - c);
_mm_storeu_si128((__m128i*)ptr, mm_or_si128(a,b));
ptr += 8;

Unrolling this loop a few times, one might be able to use _mm_alignr_epi8 on SSE3+ to relax memory bandwidth (and those pipeline stages that need to combine results from unaligned memory accesses):

auto a0 = w; 
auto a1 = _mm_load_si128(m128ptr + 1);
auto a2 = _mm_load_si128(m128ptr + 2);
auto a3 = _mm_load_si128(m128ptr + 3);
auto a4 = _mm_load_si128(m128ptr + 4);
auto b0 = _mm_alignr_epi8(a1, a0, 2);
auto b1 = _mm_alignr_epi8(a2, a1, 2);
auto b2 = _mm_alignr_epi8(a3, a2, 2);
auto b3 = _mm_alignr_epi8(a4, a3, 2);
// ... do the computation as above ...
w = a4;   // rotate the context

Inunction answered 20/11, 2021 at 8:41 Comment(6)

Surely one can use _mm_sll_epi64 also to get exactly what you needed; my assumption would be anyway, that it's faster to read from unaligned addresses when shift %8 ==0, than to run this algorithm in the first place. – Inunction 20/11, 2021 at 8:44

Is it faster to do 8 or 16 or 32 or 64 SSE bit shift instructions? Or they are the same in speed? – Durno 20/11, 2021 at 8:55

@Arty: They're all the same in speed on all CPUs, see uops.info and agner.org/optimize – Pellucid 20/11, 2021 at 10:4

As commented elsewhere too, Intel does not have 8-bit shifts at all; thus if needed, they are about 2x as slow, as they need to be emulated. But in this case any granularity will work exactly the same. – Inunction 20/11, 2021 at 11:24

I think you have a small typo - instead of b = _mm_sll_epi16(a, 16 - c); should be b = _mm_sll_epi16(b, 16 - c); (a replaced with b). Also b should be ptr - 1 if you're doing right shift, or swap _mm_srl_epi16 for a and _mm_sll_epi16 for b to do left shift. – Durno 11/5, 2022 at 7:40

Yes, you are right. fixed. – Inunction 11/5, 2022 at 8:4

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

The key to efficiency is to be lazy. The key to being lazy is to lie - pretend you shifted without actually doing any shifting.

For an initial example (to illustrate the concept only), consider:

struct Thingy {
    int ignored_bits;
    uint64_t data[];
}

void shift_right(struct Thingy * thing, int count) {
    thing->ignored_bits += count;
}

void shift_left(struct Thingy * thing, int count) {
    thing->ignored_bits -= count;
}

int get_bit(struct Thingy * thing, int bit_number) {
    bit_number += thing->ignored_bits;
    return !!(thing->data[bit_number / 64] & (1 << bit_number % 64));
}

For practical code you'll need to care about various details - you'll probably want to start with spare bits at the start of the array (and non-zero ignored_bits) so that you can pretend to shift right; for each small shift you'll probably want to clear "shifted in" bits (otherwise it'll behave like floating point - e.g. (5.0 << 8) >> 8) == 5.0); if/when ignored_bits goes outside a certain range you'll probably want a large memcpy(); etc.

For more fun; abuse low level memory management - use VirtualAlloc() (Windows) or mmap() (Linux) to reserve a huge space, then put your array in the middle of the space, then allocate/free pages at the start/end of array as needed; so that you only need to memcpy() after the original bits have been "shifted" many billions of bits to the left/right.

Of course the consequence is that it's going to complicate other parts of your code - e.g. to OR 2 bitfields together you'll have to do a tricky "fetch A; shift A to match B; result = A OR B" adjustment. This isn't a deal breaker for performance.

Garland answered 20/11, 2021 at 13:7 Comment(1)

Thanks for nice idea! UpVoted. I like idea about pretending to be shifted. Actually it can partially (not fully) help me in my task. Because I have several cumulative Shit-Or stages. And your lazy idea may help at first two or last two stages to save some memory. – Durno 20/11, 2021 at 13:20

#include <cstdint>
#include <immintrin.h>

template<unsigned Shift>
void foo(uint64_t* __restrict pDst, const uint64_t* __restrict pSrc, intptr_t size)
{
    uint64_t* pSrc0, * pSrc1, * pSrc2, * pSrc3, * pDst0, * pDst1, * pDst2, * pDst3;
    __m256i prev, current;
    intptr_t i, stride;

    stride = size >> 2;
    i = stride;

    pSrc0 = pSrc;
    pSrc1 = pSrc + stride;
    pSrc2 = pSrc + 2 * stride;
    pSrc2 = pSrc + 3 * stride;

    pDst0 = pDst;
    pDst1 = pDst + stride;
    pDst2 = pDst + 2 * stride;
    pDst3 = pDst + 3 * stride;

    prev = _mm256_set_epi64x(0, pSrc1[-1], pSrc2[-1], pSrc3[-1]);

    while (i--)
    {
        current = _mm256_set_epi64x(*pSrc0++, *pSrc1++, *pSrc2++, *pSrc3++);
        prev = _mm256_srli_epi64(prev, 64 - Shift);
        prev = _mm256_or_si256(prev, _mm256_slli_epi64(current, Shift));
        *pDst0++ = _mm256_extract_epi64(prev, 3);
        *pDst1++ = _mm256_extract_epi64(prev, 2);
        *pDst2++ = _mm256_extract_epi64(prev, 1);
        *pDst3++ = _mm256_extract_epi64(prev, 0);

        prev = current;
    }
}

You can do the operation on up to four 64bit elements at once on AVX2 (up to eight on AVX512)

If size isn't a multiple of four, there will be up to 3 remaining ones to deal with.

PS: Auto vectorization is never a proper solution.

Tenet answered 20/11, 2021 at 9:35 Comment(8)

Yuck, why would you want to gather 1 element from each of 4 positions strided through the array, instead of turning GCC's much better auto-vectorization strategy back into intrinsics like _mm256_loadu_si256? Memory-destination vpextrq mem, xmm, imm is 2 fused-domain uops on Skylake, p5 shuffle plus a p237+p4 micro-fused store, so it competes with the shuffles needed for _mm256_set_epi64x on non-constant data. Also, vpextrq from the upper half of a YMM requires a vextracti128 first, because vpextrq only works on an XMM source. – Pellucid 20/11, 2021 at 9:48

If you want to give HW prefetchers multiple pages to look at, you should be doing 2 or 4x 256-bit vectors at a time in an unrolled loop, probably still using GCC's strategy. I haven't benchmarked, but on CPUs where 4k-splits aren't extremely expensive, I expect this to be significantly slower. Maybe worse even on CPUs before Skylake were a misaligned load that crosses a 4k boundary costs over 100 cycles. (Avoiding unaligned loads seems the only advantage here, but L1d cache absorbs those well.) – Pellucid 20/11, 2021 at 9:50

@PeterCordes I don't want the same memory to be read twice. It consumes power (You know my ARM background) And since size isn't fixed, there is no guarantee that stride will be a multiple of four. And most of all, I don't know Intel architecture very well: That above is almost exactly what I would do on NEON. (except for the sri sli instead of or) – Rogerson 20/11, 2021 at 10:13

Executing all those shuffle uops very likely consumes more power than the load execution units would reading L1d with two unaligned loads instead of 1. You're not touching DRAM multiple times; the CPU has cache. Also, finishing the work and returning to a deep sleep state sooner saves much more power. x86 doesn't have strided SIMD loads in the first place, unlike NEON; I wasn't suggesting that. – Pellucid 20/11, 2021 at 10:23

@PeterCordes I already told you that I'm VERY disappointed in AVX2 a few years ago. I completely agree with Linus Torvalds on AVX. It heavily lacks in integer operations. I was shocked finding out that AVX even has no barrel shifter for 8bit in addition to permutaions being so clumsy and inconvenient. OpenCV would be much better with NEON than AVX. – Rogerson 20/11, 2021 at 10:32

Linus has most notably complained about AVX512 as a "power virus". I don't recall comments from him on AVX2. AVX-512 has much better shuffles, but still no 8-bit shifts :/ Definitely still some warts, but really unfortunate that AVX-512's nice features like masking and better shifts are still mostly limited to servers because they only come with wide vectors, and Alder Lake won't be bringing them to the desktop for the most part, not on CPUs with E-cores enabled. Anyway, not liking an extension might explain writing sub-optimal code for it, but doesn't mean other folks can't do better. – Pellucid 20/11, 2021 at 10:48

@PeterCordes AVX-512 is much better. My image processing routine runs more than ten times faster on AVX-512 than on AVX2. I very was glad AVX-512 becoming "mandatory" on Intel chips. Then Intel let me down with AlderLake with its stupid E cores. I was considering building a 12600KF machine, but settled with 11400F last week. BTW I'll let you know about this new algorithm of mine when it goes public early next year (after applying for a patent) I'd be more than happy if you reviewed it, and really appreciate any suggestion on optimizing it further. – Rogerson 20/11, 2021 at 12:42

Let us continue this discussion in chat. – Rogerson 20/11, 2021 at 12:50

-3

No, you can't

Both NEON and AVX(512) support barrel shift operations up to 64bit elements.

You can however "shift" the whole 128bit vector by n-bytes (8bits) with the instruction ext on NEON and alignr on AVX.

And you should avoid using the vector class for performance since it's nothing else than linked list which is bad for the performance.

Tenet answered 20/11, 2021 at 7:4 Comment(11)

So if SIMD can't do this, then at least I'm looking for solution about how to speedup this algorithm by any improvements that can be made. – Durno 20/11, 2021 at 7:7

@Durno SIMD CAN do this, especially NOEN with its sri and sli instructions. But ditch the vector class first. – Rogerson 20/11, 2021 at 7:9

Regarding std::vector - I just used it as example, in reality I have just two plain memory pointers uint64_t *src, *dst which point to two memory locations, and source memory should be shifted to destination. BTW, in C++ std::vector is not a linked list but a regular contiguous array, samely organized in memory as uint64_t arr[SIZE];. – Durno 20/11, 2021 at 7:9

I just remove std::vector reference in my question. You may think that I just have a plain pointer to memory region that should be shifted. – Durno 20/11, 2021 at 7:11

std::vector is not a linked list. Idk where you heard this but it's not true. A vector in c++ is a contiguous area of memory that is reallocated upon expansion. – Portugal 20/11, 2021 at 7:19

@Qix-MONICAWASMISTREATED reallocation, that's the worst thing for performance, a common trait of linked list and vector class. They all look the same to me. – Rogerson 20/11, 2021 at 7:33

If one new/memcpy looks the same as a linked-list to you, you need new glasses. The copying isn't great, but with careful use of std::vector you can avoid wasteful copying. A linked list is fundamentally different and horrible, traversing it involves load-use latency; no idea what you're talking about lumping std::vector in with LLs like std::forward_list. Also, there is room to gain some performance vs. scalar shld by using 128-bit SSE or AVX even without the OR on the fly part; that's what GMP's mpn_lshift does on modern x86-64. Especially if data's hot in L2 cache. – Pellucid 20/11, 2021 at 7:49

@PeterCordes How did you know that I'm actually looking for new glasses? :-) "with careful use" is the problem. I've seen so many who aren't careful. Especially the Java generation. It must be even worse with the Python generation. And you know me: Anything that eats up performance is evil. – Rogerson 20/11, 2021 at 7:57

Sure, but avoiding performance waste requires understanding. Calling it a linked list doesn't help anyone; the things its slow at are basically opposite to things a linked list is slow at. – Pellucid 20/11, 2021 at 8:8

@Jake NEON / SIMD has nothing to do with collection structures. I don't know what your point is. Your prior comments indicate you are indeed talking about std::vector. – Portugal 20/11, 2021 at 13:55

@Qix-MONICAWASMISTREATED Sorry, it was a misunderstanding. I thought I didn't post it, and I just deleted the comment – Rogerson 20/11, 2021 at 14:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags