Rename your function to not clash with intrinsics
Like lzcnt_epi32_sse2
or just lzcnt_epi32
. The epi32
is already enough to remind you it's related to Intel intrinsics like taking an __m128i
arg, but the lack of _mm
in the name lets you know it's just a function, and not one of Intel's SVML functions or something.
If you want to mix vector widths and need to distinguish that in your helper functions (since C doesn't allow overloading), perhaps __m1128i lzcntd_m128i( __m128i v );
. I've also seen names like mm_lzcnt_epi32
without the leading _
, but it would be very easy to miss that when reading.
static inline
__m128i lzcnt_epi32(__m128i v) {
#ifdef __AVX512VL__ // and __AVX512CD__ but that's effectively baseline
return _mm_lzcnt_epi32(v); // use the HW version if build options allow
#else
// Based on https://mcmap.net/q/1157338/-count-leading-zero-bits-for-each-element-in-avx2-vector-emulate-_mm256_lzcnt_epi32
// prevent value from being rounded up to the next power of two
v = _mm_andnot_si128(_mm_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm_castps_si128(_mm_cvtepi32_ps(v)); // convert signed integer to float ??
v = _mm_srli_epi32(v, 23); // shift down the exponent
v = _mm_subs_epu16(_mm_set1_epi32(158), v); // undo bias
v = _mm_min_epi16(v, _mm_set1_epi32(32)); // clamp at 32
return v;
#endif
}
Don't define your own functions with names that start with _
, those are reserved for use by the implementation. That reserved part of the namespace is a reasonable place for non-portable extensions that won't clash with any existing code, which is probably why Intel chose it for their intrinsics. (What are the rules about using an underscore in a C++ identifier? - C has pretty much the same rules as C++ for this, IIRC. Since your definition isn't static
, it's in the global namespace where _anything
is reserved.
Not that I'd recommend static inline
with clashing names.)
Don't follow their naming scheme for your own functions that take __m128i
args, and definitely never define your own version of an intrinsic. Those do get defined even without -mavx512vl
enabled globally, so they're usable inside functions that use __attribute__((target("avx512vl")))
, and unfortunately you end up with silent use of ISA extensions you didn't want, with no good way for GCC to detect a potential problem to even warn about it, I think.
The intrinsic's definition
_mm_lzcnt_epi32
is a real intrinsic for an AVX-512 instruction. It's declared and defined in a GCC header as an extern inline
wrapper function (around a GNU C __builtin
) inside a #pragma GCC target("avx512vl,avx512cd")
block, with __attribute__((always_inline))
. (If avx512vl
wasn't enabled globally, it will #pragma GCC pop_options
afterwards so it's only enabled for that block of definitions.)
Apparently the target-attribute part of the declaration sticks, but not the always-inline attribute which normally makes inlining fail with a compile-time error. This part may be a GCC bug. And somehow it's not an error to redefine the function, because of the gnu_inline
attribute in the header's definition1. It is an error with clang which uses different headers.
So you get a call _mm_lzcnt_epi32
in main
to a non-inline function that uses AVX-512 instructions. (Yes, GCC9.4 uses EVEX vmovdqa64 xmm1, xmm0
as well as VEX vpsrld xmm0, xmm0, 8
, as you show in your code block. This is a missed-optimization bug that was fixed in GCC10: vmovdqa xmm1, xmm0
is fewer bytes of machine code. Although I think the whole copy is avoidable by shifting into a separate destination so there is still a missed optimization, but GCC10 makes asm that will run on Godbolt's Zen 3 AWS instances, not just its SKX / Ice Lake instances.)
This is what's supposed to happen with arg = _mm_lzcnt_epi32(arg);
if you haven't defined your own version of it - a "target-specific options mismatch" error:
/opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/avx512vlintrin.h:8376:1: error: inlining failed in call to always_inline '_mm_lzcnt_epi32': target specific option mismatch
8376 | _mm_lzcnt_epi32 (__m128i __A)
| ^~~~~~~~~~~~~~~
<source>:28:9: note: called from here
28 | arg = _mm_lzcnt_epi32(arg);
| ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/immintrin.h:63,
from <source>:5:
/opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/avx512vlintrin.h:8376:1: error: inlining failed in call to always_inline '_mm_lzcnt_epi32': target specific option mismatch
8376 | _mm_lzcnt_epi32 (__m128i __A)
| ^~~~~~~~~~~~~~~
<source>:28:9: note: called from here
28 | arg = _mm_lzcnt_epi32(arg);
| ^~~~~~~~~~~~~~~~~~~~
Compiler returned: 1
Or if you use the raw builtin manually:
<source>:29:18: error: '__builtin_ia32_vplzcntd_128_mask' needs isa option -mavx512vl -mavx512cd
29 | arg = (__m128i)__builtin_ia32_vplzcntd_128_mask((__v4si)arg, (__v4si)_mm_setzero_si128(), -1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note that -msse2
is baseline for x86-64. You only need to enable it if targeting -m32
with a GCC config that doesn't do that by default. It doesn't do any harm for x86-64, but it also doesn't override AVX enabled by any earlier options like -march=x86-64-v3
or -mavx
. For that you want -mno-avx
. But that just sets the baseline for all code: pragma and per-function __attribute__
can still enable use of later ISA extensions for specific functions. gcc -msse2 -mno-avx
is equivalent to the default and won't help work around this bug of naming a function that clashes with an intrinsic.
Some Linux distros are planning to ship versions that are built with -march=x86-64-v3
(Haswell baseline: AVX2+FMA+BMI2, wikipedia) although IDK if they're planning to configure GCC with that higher baseline as a no-options default the way many do for SSE2 with gcc -m32
. But your GCC 9.4.0-1ubuntu1~20.04.1 is definitely not configured that way, and what I can see on Godbolt matches what you report your GCC doing.
Which CPUs is this relevant for?
You could argue that it isn't worth worrying about this issue specifically because nobody will be using such old hardware for anything serious,
First of all, your code uses AVX-512 instructions (vmovdqa64
) and will crash on Intel's latest desktop / laptop CPUs because they removed AVX-512 before defining a way (AVX10.1) to expose 128 and 256-bit EVEX instructions with all the great new features like masking, better shuffles, vpternlogd
, and niche instructions like vplzcntd
. They'll run fine on Zen 4, though.
Secondly, low-power Intel CPUs based on Tremont and earlier lack AVX/BMI, so there are recent low-power servers and low-end netbooks without AVX.
Also, Intel Pentium and Celeron before Ice Lake had AVX+BMI disabled. (BMI perhaps a victim of disabling decode of VEX prefixes as a way to disable AVX+FMA?) This was pretty bad, not helping the x86 ecosystem get closer to making BMI (or AVX) baseline. BMI1/BMI2 are most useful if used everywhere for stuff like more efficient variable-count shifts, not just in a couple hot loops like SIMD.
(Ice Lake Pentium/Celeron are still half-width, but that means 256-bit so x86-64-v3 without AVX-512. Low-end / low-power Alder Lake N has all Gracemont E-cores but that's the same x86-64-v3 feature level as their P-cores, thanks to Intel crippling the AVX-512 on the P-cores even in systems with no E-cores, while enhancing their E-cores to add x86-64-v3 features.)
Footnote 1: No redefinition error?
It seems that __attribute__((__gnu_inline__))
is responsible for allowing a second definition. GCC compiles this without complaint:
__attribute__ ((__gnu_inline__))
extern __inline int foo (int x) {
return x+1;
}
int foo(int x) { return x + 2; }
(__gnu_inline__
is a version of gnu_inline
that doesn't pollute the namespace, for use in -std=gnu11
mode, like __asm__
vs. asm
. Most GNU keywords have an __x__
version so headers don't break even if user code did a #define
on any non-reserved part of the namespace.)
From the GCC manual: function attributes:
gnu_inline
This attribute should be used with a function that is also declared with the inline keyword. It directs GCC to treat the function as if it were defined in gnu90 mode even when compiling in C99 or gnu99 mode.
If the function is declared extern, then this definition of the function is used only for inlining. In no case is the function compiled as a standalone function, not even if you take its address explicitly. Such an address becomes an external reference, as if you had only declared the function, and had not defined it. This has almost the effect of a macro. The way to use this is to put a function definition in a header file with this attribute, and put another copy of the function, without extern, in a library file. The definition in the header file causes most calls to the function to be inlined. If any uses of the function remain, they refer to the single copy in the library. Note that the two definitions of the functions need not be precisely the same, although if they do not have the same effect your program may behave oddly.
In C, if the function is neither extern nor static, then the function is compiled as a standalone function, as well as being inlined where possible.
This is how GCC traditionally handled functions declared inline. Since ISO C99 specifies a different semantics for inline, this function attribute is provided as a transition measure and as a useful feature in its own right. This attribute is available in GCC 4.1.3 and later. It is available if either of the preprocessor macros __GNUC_GNU_INLINE__
or __GNUC_STDC_INLINE__
are defined. See An Inline Function is As Fast As a Macro.
In C++, this attribute does not depend on extern in any way, but it still requires the inline keyword to enable its special behavior.
So I guess the version in the header wasn't a candidate for inlining because of mismatching target options, but providing a non-inline
definition let GCC call it anyway. So this might not be a GCC bug. And it's probably not something GCC should even warn about since most .c
files that provide the non-inline definition (if there is one; not the case for intrinsics I assume) will include the header that defines the extern inline
version.
Even if it were or is a bug that GCC didn't error or warn about this, don't define your own functions in a reserved part of the namespace in the first place. The most we could hope for is GCC being more helpful like erroring at compile-time instead of silently making a binary you didn't intend.
The behaviour is undefined in this case (defining functions with reserved names). Perhaps GCC could warn if it differentiated based on path, knowing which headers were "part of the implementation" vs. 3rd-party libraries. But I think glibc also uses plenty of __
names in headers in /usr/include
, so I don't think that's viable.
At first I thought GCC was allowing it because different target attributes on definitions for the same name is how GCC does function multiversioning. But this is different. If it was doing multiversioning, it would be using a non-AVX512 version because main
was compiled with just SSE2 in effect. The test-case above compiles with just gnu_inline
, no target-attribute stuff required.
#ifdef __AVX__ #error "avx active" #endif
(in separate lines of course) block at several points in your source? – Seventeen-march=native
by chance? – Sclerophyll_
or names containing__
anywhere are actually reserved for the compiler). – Seventeen-msse2
doesn't override other options, it just adds SSE2 and earlier to the set of already enabled options. It doesn't imply-mno-avx
. If you want-mno-avx
, use-mno-avx
, or change your other options to not enable it in the first place, like-march=x86-64-v2
(SSE4.2) instead of-march=x86-64-v3
(AVX2+FMA+BMI2). Or even baseline-march=x86-64
with just SSE2. (These-march
options also don't override-mavx
, e.g.-mavx -march=nehalem
still enables AVX plus everything Nehalem has, and sets-mtune=nehalem
) – Dilantin-march=x86-64-v3
since most new CPUs have had that feature level since about 2013. (But not all, e.g. pre Ice Lake, Pentium and Celeron models omitted AVX+BMI. And until Gracemont, low-power Intel cores were also just SSE4.2.) – Dilantin-mavx
, such as from-march=native
. – Dilantin-msse2
doesn't enable AVX, but it won't disable it if it was already enabled (e.g. as part of how GCC was configured, or by an earlier arg). – Dilantin