How to vectorise int8 multiplcation in C (AVX2)

Asked 4/11, 2021 at 23:5 Answered 5/11, 2021 at 2:27

How do I vectorize this C function with AVX2?

static void propogate_neuron(const short a, const int8_t *b, int *c) {

    for (int i = 0; i < 32; ++i){
        c[i] += a * b[i];
    }

}

Arhat answered 4/11, 2021 at 23:5 Comment(2)

Let the compiler do it? godbolt.org/z/nGeP46zMc Adding restrict helps. – Ecthyma 4/11, 2021 at 23:25

While this can be done (as shown), can you rearrange the code so that you have all neurons at once, and vectorize across them in a different way? I'm just guessing what you're doing with this, but it may be possible to reduce the number of loads and stores to c by reordering the overall computation. – Dislodge 7/11, 2021 at 5:39

GCC already auto-vectorizes that with a check for overlap. Promising that there's no overlap by using int *restrict c lets GCC remove that check, and gets clang to decide to auto-vectorize.

However, clang widens to 32-bit and uses vpmulld which is 2 uops on Haswell and later. (Although it's fully efficient on Zen.) GCC uses vpmullw and vpmulhw to get the low and high halves of 16-bit full multiplies, and shuffles those together. (Godbolt) This is a pretty clunky strategy, especially with -march=znver2 where vpmulld is single uop.

GCC does only have four single-uop multiply instructions, but costs a lot of shuffles to achieve it. We can do better:

Since we only need 8x16 => 32-bit multiplies, we can instead use vpmaddwd which is single-uop on Haswell/Skylake as well as Zen. https://uops.info/table.html

Unfortunately we can't take advantage of the add part since we need to add to a full 32-bit value. We need zeros in the high half of every pair of 16-bit elements to use it as just a 16x16 => 32-bit multiply within each 32-bit element.

#include <immintrin.h>

void propogate_neuron_avx2(const short a, const int8_t *restrict b, int *restrict c) {
   __m256i va = _mm256_set1_epi32( (uint16_t)a );    // [..., 0, a, 0, a] 16-bit elements

   for (int i = 0 ; i < 32 ; i+=8) {
       __m256i vb = _mm256_cvtepi8_epi32( _mm_loadl_epi64((__m128i*)&b[i]) );
       __m256i prod = _mm256_madd_epi16(va, vb);
       __m256i sum = _mm256_add_epi32(prod, _mm256_loadu_si256((const __m256i*)&c[i]));
       _mm256_storeu_si256((__m256i*)&c[i], sum);
    }
}

Godbolt:

# clang13.0 -O3 -march=haswell
        movzx   eax, di
        vmovd   xmm0, eax                     # 0:a  16-bit halves
        vpbroadcastd    ymm0, xmm0            # repeated to every element

        vpmovsxbd       ymm1, qword ptr [rsi]  # xx:b 16-bit halves
        vpmaddwd        ymm1, ymm0, ymm1       # 0 + a*b in each 32-bit element
        vpaddd  ymm1, ymm1, ymmword ptr [rdx]
        vmovdqu ymmword ptr [rdx], ymm1

... repeated 3 more times, 8 elements per vector

        vpmovsxbd       ymm1, qword ptr [rsi + 8]
        vpmaddwd        ymm1, ymm0, ymm1
        vpaddd  ymm1, ymm1, ymmword ptr [rdx + 32]
        vmovdqu ymmword ptr [rdx + 32], ymm1

If saving a uop per vector multiply makes a measurable performance difference, it might be worth the trouble of manually vectorizing in the source.

It's a missed optimization that GCC / clang don't do this in the first place when auto-vectorizing your pure C code.

If anyone wants to report this, leave a comment here. Otherwise I might get around to it. IDK if patterns like this are frequent enough for GCC / LLVM's optimizers to want to look for this pattern. Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2x 16x16 => 32-bit with horizontal add).

Incommunicado answered 5/11, 2021 at 2:27 Comment(0)

You need to add restrict qualifier to mark c that it cannot alias with b.

The issue is that int8_t is likely signed char which can alias with any other type according to strict aliasing rule. Therefore the compiler cannot be sure that setting c[i] will not modify b[i]. The forces the compiler to fetch data on every iteration.

Presence of const does not mean anything because it only limit programmer from modifying data via pointer b.

After replacing the prototype to:

void propogate_neuron(const short a, const int8_t *b, int * restrict c)

the code gets vectorized. See godbolt

Pyrethrum answered 4/11, 2021 at 23:23 Comment(3)

A more aggressive compiler could have done an overlap check, and made two versions of the loop (one for overlap, one for the non-overlap case.) The size is know ahead of the loop. For example, GCC does that without restrict: godbolt.org/z/vfx7r5Tfv. But yes, sometimes compilers just give up unless you use restrict (or C++ __restrict), and if you know it won't overlap then it helps avoid a wasted length calc + compare/branch. Also note that you only enabled -mavx, not AVX2 + tuning with -march=haswell or znver2; you get fewer insns: godbolt.org/z/hcKszerbK – Incommunicado 5/11, 2021 at 1:48

Hmm, can we avoid widening to 32-bit for vpmulld (2 uops on Haswell and later)? Maybe still widen b to 32-bit elements with vpmovsx loads, but use vpmaddwd (single uop) to do 16x16-bit widening signed multiply with a vector of [0, a, 0, a, ...], instead of sign-extending a to 32-bit. We still get 32-bit full-multiply results, and add with 0. That's a missed optimization in the auto-vectorizer, though. – Incommunicado 5/11, 2021 at 1:58

Added that as an answer. – Incommunicado 5/11, 2021 at 2:27

Recommended topics

Hot tags