I'm trying to multiply vectors of uint32_t
producing the full 64-bit result in an uint64_t
vector in gcc. The result I expect is for gcc to emit a single VPMULUDQ
instruction. But what gcc outputs as code is horrible shuffling around of the individual uint32_t
of the source vectors and then a full 64*64=64 multiplication. Here is what I've tried:
#include <stdint.h>
typedef uint32_t v8lu __attribute__ ((vector_size (32)));
typedef uint64_t v4llu __attribute__ ((vector_size (32)));
v4llu mul(v8lu x, v8lu y) {
x[1] = 0; x[3] = 0; x[5] = 0; x[7] = 0;
y[1] = 0; y[3] = 0; y[5] = 0; y[7] = 0;
return (v4llu)x * (v4llu)y;
}
The first masks out the unwanted parts of the uint32_t
vector in the hope that gcc would optimize away the unneeded parts of the 64*64=64 multiplication and then see the masking is pointless as well. No such luck.
v4llu mul2(v8lu x, v8lu y) {
v4llu tx = {x[0], x[2], x[4], x[6]};
v4llu ty = {y[0], y[2], y[4], y[6]};
return tx * ty;
}
Here I try to create a uint64_t
vector from scratch with only the used parts set. Again gcc should see the top 32 bits of each uint64_t
are 0 and not do a full 64*64=64 multiply. Instead, a lot of extracting and putting back of the values happens, and a 64*64=64 multiply.
v4llu mul3(v8lu x, v8lu y) {
v4llu t = {x[0] * (uint64_t)y[0], x[2] * (uint64_t)y[2], x[4] * (uint64_t)y[4], x[6] * (uint64_t)y[6]};
return t;
}
Let's build the result vector by multiplying the parts. Maybe gcc sees that it can use VPMULUDQ
to achieve exactly that. No luck, it falls back to 4 IMUL
opcodes.
Is there a way to tell gcc what I want it to do (32*32=64 multiplication with everything perfectly placed)?
Note: Inline asm or the intrinsic isn't the answer. Writing the opcode by hand obviously works. But then I would have to write different versions of the code for many target architectures and feature sets. I want gcc to understand the problem and produce the right solution from a single source code.
v4di __builtin_ia32_pmuludq256 (v8si,v8si)
– Conciliarmul
andmul2
are optimized fine with clang: godbolt.org/z/d3MAay,mul3
is not equivalent, since it needs to truncate the results to 32 bits. I guess your options are: a) Use clang, b) use intrinsics, c) provide a patch to gcc which properly optimizes this (or file a bug and hope someone else fixes it). – Swordsman_mm256_mul_epu32
, defined byimmintrin.h
– Populace__AVX2__
with#if
and use the intrinsic if it is__AVX2__
is non-zero and other code if it is not. – Upu__MMX__
and__SSE__
and__SSE2__
and__SSE3__
and__SSE4__
and__NEON__
and__NEON2__
and some 30 other. As said that is not what I want. – Decretive