How do you handle indivisible vector lengths with SIMD intrinsics, array not a multiple of vector width?
Asked Answered
M

1

1

I am currently learning how to work with SIMD intrinsics. I know that an AVX 256-bit vector can contain four doubles, eight floats, or eight 32-bit integers. How do we use AVX to process arrays that aren't a multiple of these numbers.

For example, how would you add two std::vectors of 53 integers each? Would we slice as many of the vector that would fit in the SIMD vector and just manually process the remainder? Is there a better way to do this?

Meingoldas answered 16/9, 2022 at 3:18 Comment(4)
I would usually try to just use one of the (parallel) std algorithms and let the compiler worry about it.Atrocious
@JesperJuhl: It would be nice if compilers were better at this, and used better tricks like an unaligned final vector that ends at the end of the input arrays, possibly overlapping with earlier work for problems that are idempotent or where you can read that final vector before a store. That works for non-reductions where it doesn't matter if you process the same element twice.Equal
Related: Utilize memory past the end of a std::vector using a custom overallocating allocator Also related: an example of what GCC does when auto-vectorizing: Why does p1007r0 std::assume_aligned remove the need for epilogue? (especially older GCC which liked to use a prologue to reach an alignment boundary.)Equal
Near duplicate of Jump back some iterations for vectorized remainder loop and/or Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. Leaving this open for now, as both of those are about specific strategies. My answer on the first one mentions other strategies. Also Handling elements that are odd number using neon intrinsics has an interesting implementation.Equal
M
3

Would we slice as many of the vector that would fit in the SIMD vector and just manually process the remainder? Is there a better way to do this?

Pretty much this. A basic example that processes all number in batches of 8, and uses mask load/maskstore to handle the remainder.

void add(int* const r, const int* const a, const int* const b, const unsigned count) {

    // how many blocks of 8, and how many left over
    const unsigned c8 = count & ~0x7U;
    const unsigned cr = count & 0x7U;

    // process blocks of 8
    for(unsigned i = 0; i < c8; i += 8) {
        __m256i _a = _mm256_loadu_si256((__m256i*)(a + i));
        __m256i _b = _mm256_loadu_si256((__m256i*)(b + i));
        __m256i _c = _mm256_add_epi32(_a, _b);
        _mm256_storeu_si256((__m256i*)(c + i), _c);
    }

    const __m128i temp[5] = {
        _mm_setr_epi32(0, 0, 0, 0),
        _mm_setr_epi32(-1, 0, 0, 0),
        _mm_setr_epi32(-1, -1, 0, 0),
        _mm_setr_epi32(-1, -1, -1, 0),
        _mm_setr_epi32(-1, -1, -1, -1)
    };

    // I'm using mask load / mask store for the remainder here. 
    // (this is not the only approach)
    __m256i mask;
    if(cr >= 4) { 
        mask = _mm256_set_m128i(temp[cr&3], temp[4]);
    } else {
        mask = _mm256_set_m128i(temp[0], temp[cr]);
    }
    __m256i _a = _mm256_maskload_epi32((a + c8), mask);
    __m256i _b = _mm256_maskload_epi32((b + c8), mask);
    __m256i _c = _mm256_add_epi32(_a, _b);
    _mm256_maskstore_epi32((c + c8), mask, _c);
}

Of course, if you happen to use your own containers (or provide your own allocators), then you can avoid most of this mess by simply ensuring all container allocations occur in multiples of 256bits.

// yes, this class is missing a lot... 
class MyIntArray {
public:

   MyIntArray(unsigned count, const int* data) {
      // bump capacity to next multiple of 8
      unsigned cap = count & 7;
      if(cap) cap = 8 - cap;
      capacity = cap + count;
      // allocation is aligned to 256bit
      alloc = new int[capacity];
      size = count;
      memcpy(alloc, data, sizeof(int) * size);
   }

   MyIntArray(unsigned count) {
      // bump capacity to next multiple of 8
      unsigned cap = count & 7;
      if(cap) cap = 8 - cap;
      capacity = cap + count;
      // allocation is aligned to 256bit
      alloc = new int[capacity];
      size = count;
   }

   unsigned capacity;
   unsigned size;
   int* alloc;

   int* begin() { return alloc; }
   int* end() { return alloc + size; }
   const int* begin() const { return alloc; }
   const int* end() const { return alloc + size; }
};

void add(MyIntArray r, const MyIntArray a, const MyIntArray b) {

    // process blocks of 8.
    // we may be stamping beyond the end of the array, but not over the 
    // the end of the capacity allocation....
    // (probably also want to check to see if the sizes match!).
    for(unsigned i = 0; i < r.size; i += 8) {
        __m256i _a = _mm256_loadu_si256((__m256i*)(a.alloc + i));
        __m256i _b = _mm256_loadu_si256((__m256i*)(b.alloc + i));
        __m256i _c = _mm256_add_epi32(_a, _b);
        _mm256_storeu_si256((__m256i*)(c.alloc + i), _c);
    }
}
Muth answered 16/9, 2022 at 4:6 Comment(1)
You don't need an array of 5 __m128i constants, you only need an array of alignas(32) int mask[8] = {-1,-1,-1,-1, 0,0,0,0};, and load a sliding window into it from 4-count or something. As in Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all or Left shift a vector by runtime variable number of bytes. Pack it more densely by loading with vpmovsxbd to sign-extend bytes to dwords. Aligning it makes any window into it not split a cache line.Equal

© 2022 - 2024 — McMap. All rights reserved.