uint32_t * uint32_t = uint64_t vector multiplication with gcc
Asked Answered
D

1

6

I'm trying to multiply vectors of uint32_t producing the full 64-bit result in an uint64_t vector in gcc. The result I expect is for gcc to emit a single VPMULUDQ instruction. But what gcc outputs as code is horrible shuffling around of the individual uint32_t of the source vectors and then a full 64*64=64 multiplication. Here is what I've tried:

#include <stdint.h>

typedef uint32_t v8lu __attribute__ ((vector_size (32)));
typedef uint64_t v4llu __attribute__ ((vector_size (32)));

v4llu mul(v8lu x, v8lu y) {
    x[1] = 0; x[3] = 0; x[5] = 0; x[7] = 0;
    y[1] = 0; y[3] = 0; y[5] = 0; y[7] = 0;
    return (v4llu)x * (v4llu)y;
}

The first masks out the unwanted parts of the uint32_t vector in the hope that gcc would optimize away the unneeded parts of the 64*64=64 multiplication and then see the masking is pointless as well. No such luck.

v4llu mul2(v8lu x, v8lu y) {
    v4llu tx = {x[0], x[2], x[4], x[6]};
    v4llu ty = {y[0], y[2], y[4], y[6]};
    return tx * ty;
}

Here I try to create a uint64_t vector from scratch with only the used parts set. Again gcc should see the top 32 bits of each uint64_t are 0 and not do a full 64*64=64 multiply. Instead, a lot of extracting and putting back of the values happens, and a 64*64=64 multiply.

v4llu mul3(v8lu x, v8lu y) {
    v4llu t = {x[0] * (uint64_t)y[0], x[2] * (uint64_t)y[2], x[4] * (uint64_t)y[4], x[6] * (uint64_t)y[6]};
    return t;
}

Let's build the result vector by multiplying the parts. Maybe gcc sees that it can use VPMULUDQ to achieve exactly that. No luck, it falls back to 4 IMUL opcodes.

Is there a way to tell gcc what I want it to do (32*32=64 multiplication with everything perfectly placed)?

Note: Inline asm or the intrinsic isn't the answer. Writing the opcode by hand obviously works. But then I would have to write different versions of the code for many target architectures and feature sets. I want gcc to understand the problem and produce the right solution from a single source code.

Decretive answered 13/11, 2019 at 13:9 Comment(13)
Are you looking for v4di __builtin_ia32_pmuludq256 (v8si,v8si)Conciliar
@JL2210: The type promotion rules are not pertinent. The question does not ask for a standard C way to do this. It asks for GCC features.Upu
@ben: "the intrinsic isn't the answere"Decretive
If you just want to know how to make GCC do what you want, why not use the intrinsic that @Conciliar proposed? It seems fragile to rely on creating some pattern of code that the version of GCC that you're using right now happens to recognize and emit the code that you want. If you want to know it will work, use the intrinsic function that explicitly specifies your intent.Uzial
@GoswinvonBrederlow: Why is the intrinsic not the answer? If it does what you want, why not use it?Upu
@EricPostpischil: As I understand it, with this extension GCC (very reasonably) follows the C practice that the result of the arithmetic operation is the (possibly promoted) operand type. If you want 32x32->64 you'd need to promote one to 64 before the operation and rely on it being optimized correctly.Algol
mul and mul2 are optimized fine with clang: godbolt.org/z/d3MAay, mul3 is not equivalent, since it needs to truncate the results to 32 bits. I guess your options are: a) Use clang, b) use intrinsics, c) provide a patch to gcc which properly optimizes this (or file a bug and hope someone else fixes it).Swordsman
@Ben: the standard portable intrinsic is _mm256_mul_epu32, defined by immintrin.hPopulace
@Swordsman Added the missing cast to 64bit to mul3. Doesn't make gcc use the pmuludq.Decretive
@EricPostpischil Because if I wanted to use the intrinsic I would have done so. The goal is to get the compiler to produce the right opcode for the -m<arch> specified. The intrinsic will fail to compile if -mavx2 isn't used.Decretive
@GoswinvonBrederlow: You can test __AVX2__ with #if and use the intrinsic if it is __AVX2__ is non-zero and other code if it is not.Upu
@EricPostpischil and __MMX__ and __SSE__ and __SSE2__ and __SSE3__ and __SSE4__ and __NEON__ and __NEON2__ and some 30 other. As said that is not what I want.Decretive
@GoswinvonBrederlow: “not what I want” and “if I wanted to use the intrinsic I would have done so” are not justifiable reasons. “Because we need to support many different target architectures and writing individual code for each is too costly” is. Edit the question to state your full requirements, based on actual project requirements, not on “wants.”Upu
D
2

As noted in the comments by chtz both mul1 and mul2 are optimized right by clang. Code similar to mul3 but using a for loop will be optimized too (but not as well).

So to me it looks like the syntax is correct to express what the code should do and gcc simply lacks the smarts so far to optimize this properly.

Decretive answered 14/11, 2019 at 14:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.