It's unfortunate that Intel's intrinsics aren't defined as constexpr
.
There's no reason they couldn't be; compilers can and do evaluate them at compile time for constant-propagation and other optimizations. (This is one major reason why builtin functions / intrinsics are better than inline asm wrappers for single instructions.)
Solution for GCC. (Doesn't work for clang or MSVC).
ICC compiles it but chokes when you try to use it as part of an initializer for a constexpr __m128i
.
constexpr
__m128i pcmpeqd(__m128i a, __m128i b) {
return (v4si)a == (v4si)b; // fine with gcc and ICC
//return (__m128i)__builtin_ia32_pcmpeqd128((v4si)a, (v4si)b); // bad with ICC
//return _mm_cmpeq_epi32(a,b); // not constexpr-compatible
}
See it on the Godbolt compiler explorer, with two test callers (one with variables, one with
constexpr __m128i v1 {0x100000000, 0x300000002};
inputs). Interestingly, ICC doesn't do constant-propagation through pcmpeqd
or _mm_cmpeq_epi32
; it loads two constants and uses and actual pcmpeqd
, even with optimization enabled. The same thing happens with/without constexpr.I think it normally optimizes
gcc does accept
constexpr __m128i vector_const { pcmpeqd(__m128i{0,0}, __m128i{-1,-1}) };
GCC (but not clang) treats __builtin_ia32
functions as constexpr
-compatible. The documentation for GNU C x86 built-in functions doesn't mention this, but probably only because it's C documentation, not C++.
GNU C native vector syntax is also constexpr
-compatible; that's a second option that's again only viable if you don't care about MSVC.
GNU C defines __m128i
as a vector of two long long
elements. So for integer SIMD, you need to define other types (or use the types defined by gcc/clang/ICC's immintrin.h
(The only weird thing is that static const __m128i foo = _mm_set1_epi32(2);
doesn't turn into a constant initializer; it copies from .rodata
at runtime, and thus is terrible, using a guard variable which is checked on every function call to see if the variable needs to be statically initialized.)
GCC's xmmintrin.h
and emmintrin.h
define Intel intrinsics in terms of native vector operators (like *
) or __builtin_ia32
functions. It looks like they prefer using operators when possible, instead of (__m128i)__builtin_ia32_pcmpeqd128((v4si)a, (v4si)b);
gcc does require explicit casts between different vector types.
From gcc7.3's emmintrin.h
(SSE2):
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_cmpeq_epi32 (__m128i __A, __m128i __B)
{
return (__m128i) ((__v4si)__A == (__v4si)__B);
}
#ifdef __OPTIMIZE__
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_shuffle_epi32 (__m128i __A, const int __mask)
{
return (__m128i)__builtin_ia32_pshufd ((__v4si)__A, __mask);
}
#else
#define _mm_shuffle_epi32(A, N) \
((__m128i)__builtin_ia32_pshufd ((__v4si)(__m128i)(A), (int)(N)))
#endif
Interesting: gcc's header avoids an inline function in some cases if compiling with optimization disabled. I guess this leads to better debug symbols, so you don't single-step into the definition of the inline function (which does happen when using stepi
in GDB in optimized code with a TUI source window showing.)
__builtin_constant_p
extension which allows to use tricks like#define FOO(x) (__builtin_constant_p(x) ? foo_constexpr(x) : foo_asm(x))
- ifx
is can be evaluated as a constant by the compiler then pure C++ implementation will be used allowing further inlining and compile-time optimizations. – Ulick_mm_cmpeq_epi32
is just a thin wrapper around__builtin_ia32_pcmpeqd128
, not inline asm. That's why the compiler can optimize intrinsics when operands are compile-time constants. (Or just generally optimize, especially clang has a good shuffle optimizer.) But those builtins aren't constexpr either, I don't think. – Econstexpr
function at compile time. If your compiler does not know how to evaluate some/any/all SIMD builtins, functions using those cannot beconstexpr
. Notice that evaluating a function at compile time is quite different from compiling a function; you could be cross-compiling for another platform so the compiler might not even be able to run the function after compilation to get its value. Hence, there would be special emulation code needed for the compiler to emulate the function in 'plain C++' which is apparently not there. – Luffa+
or==
operator on integers, but gcc unfortunately chose not to define it that way. I'm not sure about GNU C native vector syntax; that might work as constexpr. likeconstexpr __m128 foo(__m128 a, __m128 b){ return a+b; }
. – E__builtin_ia32
functions at compile time; it can do constant-propagation through them, just like the+
operator for scalarint
for example. It's purely an unfortunate issue of C++ syntax and how things are declared. (The only weird thing is thatstatic const __m128 foo = _mm_set1_ps(2.0f);
doesn't turn into a constant initializer; it copies from.rodata
at runtime, and thus is terrible.) – Ereturn _pdep_u32(0x1234, 0xaaaa);
(a pretty good example of a complex operation), and basically all SIMD functions including_mm_shuffle_epi8
(pshufb), IIRC. The "later stage" issue might be what's going on forstatic foo = _mm_set_epi32()
though? Instead of a constant in read-only memory, you get a runtime copy from a.LC0
vector constant to the named static storage. /facepalm. – E