How do I vectorize this C function with AVX2?
static void propogate_neuron(const short a, const int8_t *b, int *c) {
for (int i = 0; i < 32; ++i){
c[i] += a * b[i];
}
}
How do I vectorize this C function with AVX2?
static void propogate_neuron(const short a, const int8_t *b, int *c) {
for (int i = 0; i < 32; ++i){
c[i] += a * b[i];
}
}
GCC already auto-vectorizes that with a check for overlap. Promising that there's no overlap by using int *restrict c
lets GCC remove that check, and gets clang to decide to auto-vectorize.
However, clang widens to 32-bit and uses vpmulld
which is 2 uops on Haswell and later. (Although it's fully efficient on Zen.) GCC uses vpmullw
and vpmulhw
to get the low and high halves of 16-bit full multiplies, and shuffles those together. (Godbolt) This is a pretty clunky strategy, especially with -march=znver2
where vpmulld
is single uop.
GCC does only have four single-uop multiply instructions, but costs a lot of shuffles to achieve it. We can do better:
Since we only need 8x16 => 32-bit multiplies, we can instead use vpmaddwd
which is single-uop on Haswell/Skylake as well as Zen. https://uops.info/table.html
Unfortunately we can't take advantage of the add part since we need to add to a full 32-bit value. We need zeros in the high half of every pair of 16-bit elements to use it as just a 16x16 => 32-bit multiply within each 32-bit element.
#include <immintrin.h>
void propogate_neuron_avx2(const short a, const int8_t *restrict b, int *restrict c) {
__m256i va = _mm256_set1_epi32( (uint16_t)a ); // [..., 0, a, 0, a] 16-bit elements
for (int i = 0 ; i < 32 ; i+=8) {
__m256i vb = _mm256_cvtepi8_epi32( _mm_loadl_epi64((__m128i*)&b[i]) );
__m256i prod = _mm256_madd_epi16(va, vb);
__m256i sum = _mm256_add_epi32(prod, _mm256_loadu_si256((const __m256i*)&c[i]));
_mm256_storeu_si256((__m256i*)&c[i], sum);
}
}
# clang13.0 -O3 -march=haswell
movzx eax, di
vmovd xmm0, eax # 0:a 16-bit halves
vpbroadcastd ymm0, xmm0 # repeated to every element
vpmovsxbd ymm1, qword ptr [rsi] # xx:b 16-bit halves
vpmaddwd ymm1, ymm0, ymm1 # 0 + a*b in each 32-bit element
vpaddd ymm1, ymm1, ymmword ptr [rdx]
vmovdqu ymmword ptr [rdx], ymm1
... repeated 3 more times, 8 elements per vector
vpmovsxbd ymm1, qword ptr [rsi + 8]
vpmaddwd ymm1, ymm0, ymm1
vpaddd ymm1, ymm1, ymmword ptr [rdx + 32]
vmovdqu ymmword ptr [rdx + 32], ymm1
If saving a uop per vector multiply makes a measurable performance difference, it might be worth the trouble of manually vectorizing in the source.
It's a missed optimization that GCC / clang don't do this in the first place when auto-vectorizing your pure C code.
If anyone wants to report this, leave a comment here. Otherwise I might get around to it. IDK if patterns like this are frequent enough for GCC / LLVM's optimizers to want to look for this pattern. Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2x 16x16 => 32-bit with horizontal add).
You need to add restrict
qualifier to mark c
that it cannot alias with b
.
The issue is that int8_t
is likely signed char
which can alias with any other type according to strict aliasing rule. Therefore the compiler cannot be sure that setting c[i]
will not modify b[i]
.
The forces the compiler to fetch data on every iteration.
Presence of const
does not mean anything because it only limit programmer from modifying data via pointer b
.
After replacing the prototype to:
void propogate_neuron(const short a, const int8_t *b, int * restrict c)
the code gets vectorized. See godbolt
restrict
: godbolt.org/z/vfx7r5Tfv. But yes, sometimes compilers just give up unless you use restrict
(or C++ __restrict
), and if you know it won't overlap then it helps avoid a wasted length calc + compare/branch. Also note that you only enabled -mavx
, not AVX2 + tuning with -march=haswell
or znver2
; you get fewer insns: godbolt.org/z/hcKszerbK –
Incommunicado vpmulld
(2 uops on Haswell and later)? Maybe still widen b
to 32-bit elements with vpmovsx
loads, but use vpmaddwd
(single uop) to do 16x16-bit widening signed multiply with a vector of [0, a, 0, a, ...]
, instead of sign-extending a
to 32-bit. We still get 32-bit full-multiply results, and add with 0. That's a missed optimization in the auto-vectorizer, though. –
Incommunicado © 2022 - 2024 — McMap. All rights reserved.
restrict
helps. – Ecthymac
by reordering the overall computation. – Dislodge