Constexpr and SSE intrinsics
Asked Answered
F

2

14

Most C++ compilers support SIMD(SSE/AVX) instructions with intrisics like

_mm_cmpeq_epi32

My problem with this is that this function is not marked as constexpr, although "semantically" there is no reason for this function to not be constexpr since it is a pure function.

Is there any way I could write my own version of (for example) _mm_cmpeq_epi32 that is constexpr?

Obviously I would like that the function at runtime uses the proper asm, I know I can reimplement any SIMD function with slow function that is constexpr.

If you wonder why I care about constexpr of SIMD functions. Non constexprness is contagious, meaning that any functions of mine that use those SIMD functions can not be constexpr.

Favourable answered 16/8, 2018 at 14:59 Comment(17)
I think there were similar questions here, as this is the usual problem that functions cannot be overloaded depending on constexpr-ness.Exeter
Sorry, you are out of luck. Inline assembly can't be used in constexpr functions, so you can't write your own.Clothier
@SergeyA: that's not inline assembly.Exeter
Not possible in standard C++, but for instance, GCC defines __builtin_constant_p extension which allows to use tricks like #define FOO(x) (__builtin_constant_p(x) ? foo_constexpr(x) : foo_asm(x)) - if x is can be evaluated as a constant by the compiler then pure C++ implementation will be used allowing further inlining and compile-time optimizations.Ulick
Possible duplicate of constexpr overloadingUlick
@geza: In GNU C, _mm_cmpeq_epi32 is just a thin wrapper around __builtin_ia32_pcmpeqd128, not inline asm. That's why the compiler can optimize intrinsics when operands are compile-time constants. (Or just generally optimize, especially clang has a good shuffle optimizer.) But those builtins aren't constexpr either, I don't think.E
@PeterCordes: yes, that's what I was trying to say to SergeyA, too :) Or maybe I misunderstand something here?Exeter
Possible duplicate of Branching on constexpr evaluation / overloading on constexprExeter
@geza: yup, I meant to reply to SergeyA, not you. Oops.E
Ping @SergeyA, see my earlier comment.E
The compiler must be able to evaluate a constexpr function at compile time. If your compiler does not know how to evaluate some/any/all SIMD builtins, functions using those cannot be constexpr. Notice that evaluating a function at compile time is quite different from compiling a function; you could be cross-compiling for another platform so the compiler might not even be able to run the function after compilation to get its value. Hence, there would be special emulation code needed for the compiler to emulate the function in 'plain C++' which is apparently not there.Luffa
@PeterCordes this is not important, since the said builtin is not constexpr. And the builtin can't be constexpr.Clothier
@SergeyA: It could be, just the + or == operator on integers, but gcc unfortunately chose not to define it that way. I'm not sure about GNU C native vector syntax; that might work as constexpr. like constexpr __m128 foo(__m128 a, __m128 b){ return a+b; }.E
@JimmyB: gcc does know how to evaluate all the __builtin_ia32 functions at compile time; it can do constant-propagation through them, just like the + operator for scalar int for example. It's purely an unfortunate issue of C++ syntax and how things are declared. (The only weird thing is that static const __m128 foo = _mm_set1_ps(2.0f); doesn't turn into a constant initializer; it copies from .rodata at runtime, and thus is terrible.)E
@PeterCordes "gcc does know how to evaluate all the __builtin_ia32 functions at compile time" I strongly doubt that. It may for the simpler ones (at a stage way too late to be usable in constexpr), but several are completely opaque to the compiler.Unimposing
@MarcGlisse: Maybe not strictly all, but constant-propagation works through return _pdep_u32(0x1234, 0xaaaa); (a pretty good example of a complex operation), and basically all SIMD functions including _mm_shuffle_epi8 (pshufb), IIRC. The "later stage" issue might be what's going on for static foo = _mm_set_epi32() though? Instead of a constant in read-only memory, you get a runtime copy from a .LC0 vector constant to the named static storage. /facepalm.E
@PeterCordes are you sure about _mm_shuffle_epi8? I just tried giving it 2 null vectors, and it didn't optimize anything. The read-only constant thing is a well-known limitation, currently it has to be done in the front-end, and there is nothing in case the compiler realizes later that it was actually a constant. I really hope this will change some day, but I wouldn't hold my breath. Relevant: gcc.gnu.org/bugzilla/show_bug.cgi?id=65197 (and 55894, 80517).Unimposing
E
4

It's unfortunate that Intel's intrinsics aren't defined as constexpr.

There's no reason they couldn't be; compilers can and do evaluate them at compile time for constant-propagation and other optimizations. (This is one major reason why builtin functions / intrinsics are better than inline asm wrappers for single instructions.)


Solution for GCC. (Doesn't work for clang or MSVC).

ICC compiles it but chokes when you try to use it as part of an initializer for a constexpr __m128i.

constexpr
__m128i pcmpeqd(__m128i a, __m128i b) {
    return (v4si)a == (v4si)b;      // fine with gcc and ICC

    //return (__m128i)__builtin_ia32_pcmpeqd128((v4si)a, (v4si)b); // bad with ICC
    //return _mm_cmpeq_epi32(a,b);  // not constexpr-compatible
}

See it on the Godbolt compiler explorer, with two test callers (one with variables, one with
constexpr __m128i v1 {0x100000000, 0x300000002}; inputs). Interestingly, ICC doesn't do constant-propagation through pcmpeqd or _mm_cmpeq_epi32; it loads two constants and uses and actual pcmpeqd, even with optimization enabled. The same thing happens with/without constexpr.I think it normally optimizes

gcc does accept constexpr __m128i vector_const { pcmpeqd(__m128i{0,0}, __m128i{-1,-1}) };


GCC (but not clang) treats __builtin_ia32 functions as constexpr-compatible. The documentation for GNU C x86 built-in functions doesn't mention this, but probably only because it's C documentation, not C++.

GNU C native vector syntax is also constexpr-compatible; that's a second option that's again only viable if you don't care about MSVC.

GNU C defines __m128i as a vector of two long long elements. So for integer SIMD, you need to define other types (or use the types defined by gcc/clang/ICC's immintrin.h


(The only weird thing is that static const __m128i foo = _mm_set1_epi32(2); doesn't turn into a constant initializer; it copies from .rodata at runtime, and thus is terrible, using a guard variable which is checked on every function call to see if the variable needs to be statically initialized.)


GCC's xmmintrin.h and emmintrin.h define Intel intrinsics in terms of native vector operators (like *) or __builtin_ia32 functions. It looks like they prefer using operators when possible, instead of (__m128i)__builtin_ia32_pcmpeqd128((v4si)a, (v4si)b);

gcc does require explicit casts between different vector types.

From gcc7.3's emmintrin.h (SSE2):

extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpeq_epi32 (__m128i __A, __m128i __B)
{
  return (__m128i) ((__v4si)__A == (__v4si)__B);
}

#ifdef __OPTIMIZE__
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_shuffle_epi32 (__m128i __A, const int __mask)
{
  return (__m128i)__builtin_ia32_pshufd ((__v4si)__A, __mask);
}
#else
#define _mm_shuffle_epi32(A, N) \
  ((__m128i)__builtin_ia32_pshufd ((__v4si)(__m128i)(A), (int)(N)))
#endif

Interesting: gcc's header avoids an inline function in some cases if compiling with optimization disabled. I guess this leads to better debug symbols, so you don't single-step into the definition of the inline function (which does happen when using stepi in GDB in optimized code with a TUI source window showing.)

E answered 18/8, 2018 at 18:28 Comment(1)
The unoptimized macro path is because some instructions require an immediate constant argument, which would be problematic to obtain at -O0 otherwise (need to inline the function, then propagate the value).Unimposing
W
1

There is now a cross-platform solution in c++20. std::is_constant_evaluated allows us to do exactly this.

template<typename T>
constexpr auto add(T&& l, T&& r) noexcept
{
    if (std::is_constant_evaluated())
        slow_add(std::forward<T>(l), std::forward<T>(r));
    else
        _mm_add_pd(l.value, r.value);
}

Note the use of a normal if statement here. It is tempting to use if constexpr, but this will always result in the function returning true. Do not worry, the branch will always be optimized out, since the value of std::is_constant_evaluated is always known at compile time (even if it returns false).

Wester answered 22/6, 2021 at 1:55 Comment(4)
You still need a portable constexpr-compatible way to implement slow_add, which may require #ifdef to get at the elements in an MSVC way or a GNU C native-vector way. Without any non-portable stuff, __m128i is opaque, and all the intrinsics that would let you get at its elements (including _mm_store_si128 and _mm_load_si128) aren't declared constexpr (hence the original problem).E
Or did you mean use a union { __m128i value; int32_t i32[4]; }; or similar?E
related: How to combine constexpr and vectorized code? has basically the same answer.E
At first, I was thinking you could use std::bit_cast, but I see that MSVC implements __m128i and company as a union, making this impossible on MSVC. reinterpret_cast obviously doesn't work in constexpr. It's possible that to make this cross-platform, you need to check for constant evaluation at a higher level, before performing the _mm_load call.Wester

© 2022 - 2024 — McMap. All rights reserved.