C program compiled with gcc -msse2 contains AVX1 instructions
Asked Answered
B

1

6

I adapted a function I found on SO for SSE2 and included it in my program. The function uses SSE2 intrinsics to calculate the leading zero count of each of the 8 x 16bit integers in the vector. When I compiled the program, which produced no warnings, and ran it on my old laptop which I often use for development, it crashed with the message 'Illegal instruction (core dumped)'. On inspecting the assembly, I noticed my function appeared to have AVX1 'VEX' encoded SSE2 instructions as shown below.

    .globl  _mm_lzcnt_epi32
    .type   _mm_lzcnt_epi32, @function
_mm_lzcnt_epi32:
.LFB5318:
    .cfi_startproc
    endbr64
    vmovdqa64   %xmm0, %xmm1
    vpsrld  $8, %xmm0, %xmm0
    vpandn  %xmm1, %xmm0, %xmm0
    vmovdqa64   .LC0(%rip), %xmm1
    vcvtdq2ps   %xmm0, %xmm0
    vpsrld  $23, %xmm0, %xmm0
    vpsubusw    %xmm0, %xmm1, %xmm0
    vpminsw .LC1(%rip), %xmm0, %xmm0
    ret
    .cfi_endproc

If I change the header immintrin.h to emmintrin.h, it compiles my code properly to SSE2 instructions

    .globl  _mm_lzcnt_epi32
    .type   _mm_lzcnt_epi32, @function
_mm_lzcnt_epi32:
.LFB567:
    .cfi_startproc
    endbr64
    movdqa  %xmm0, %xmm1
    psrld   $8, %xmm0
    pandn   %xmm1, %xmm0
    cvtdq2ps    %xmm0, %xmm1
    movdqa  .LC0(%rip), %xmm0
    psrld   $23, %xmm1
    psubusw %xmm1, %xmm0
    pminsw  .LC1(%rip), %xmm0
    ret
    .cfi_endproc

and runs successfully. My program is as follows.

#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <immintrin.h>

// gcc ssebug.c -o ssebug.bin -O3 -msse2 -Wall

__m128i _mm_lzcnt_epi32(__m128i v) {
    // Based on https://mcmap.net/q/1157338/-count-leading-zero-bits-for-each-element-in-avx2-vector-emulate-_mm256_lzcnt_epi32
    // prevent value from being rounded up to the next power of two
    v = _mm_andnot_si128(_mm_srli_epi32(v, 8), v); // keep 8 MSB
    v = _mm_castps_si128(_mm_cvtepi32_ps(v)); // convert signed integer to float ??
    v = _mm_srli_epi32(v, 23); // shift down the exponent
    v = _mm_subs_epu16(_mm_set1_epi32(158), v); // undo bias
    v = _mm_min_epi16(v, _mm_set1_epi32(32)); // clamp at 32
    return v;
}

int main(int argc, char **argv) {
  uint32_t i, a[4];  
  __m128i arg;
  uint32_t argval = 123;
  if (argc >= 2) argval = atoi(argv[1]);
  arg = _mm_set1_epi32(argval);
  arg = _mm_lzcnt_epi32(arg);
  _mm_storeu_si128((void*)a, arg);
  for(i=0; i<4; i++) {
    printf("%u ", a[i]);
  }
  printf("\n");
}

This explanation, Header files for x86 SIMD intrinsics, appears to suggest that for gcc at least, it is safe to just use immintrin.h for everything, which appears to be false. My questions are as follows.

  1. Is it supposed to be safe to use immintrin.h for everything, or does using it tell the compiler to assume at least AVX1?

  2. Isn't it the compiler's responsibility to produce ONLY instructions which are valid for the target architecture? If not, why not?

  3. Why does it work (produce only SSE2) if I use immintrin.h but make my function static inline?

  4. Is there a way to scan an assembly file to reveal what CPU extensions it requires?

  5. Who should I contact about such issues in future?

I think this is potentially quite a serious issue as it isn't always feasible to check the assembler contains only valid instructions for the target architecture. I only found this because my program crashed, and I was using an old machine which doesn't support AVX1. If the function was in some hardly ever executed branch, I might have missed it. You could argue that it isn't worth worrying about this issue specifically because nobody will be using such old hardware for anything serious, but the issues it raises could well apply to newer architectures too. Thanks for your time. I am using gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0.

Boyar answered 31/12, 2023 at 12:40 Comment(15)
C and C++ are not the same language. Please don't tag both languages.Resht
@JosephLarson I thought it was relevant to the c++ community too.Boyar
@SimonGoater Only tag C++ if you program in C++, not because "it's C nd that is close enough."Sclerophyll
Could you provide the exact commands you used to compile your code?Sclerophyll
@Sclerophyll I used the command shown in the comment in the program.Boyar
Can you put a #ifdef __AVX__ #error "avx active" #endif (in separate lines of course) block at several points in your source?Seventeen
@Seventeen It displays the error if I do that. I didn't know but now I think the problem is related to the fact that _mm_lzcnt_epi32 is an AVX512 intrinsic included in avx512vlintrin.h.Boyar
@SimonGoater This indicates that AVX is enabled by default in your version of gcc. What operating system are you programming on an where did you obtain gcc from? Do you supply -march=native by chance?Sclerophyll
Apologies, I think I misunderstood what chtz was asking of me. I tried putting the macro before the function, in the function, after the function, in main, and after main and I don't get the error message when I compile with -msse2, but I still get AVX instructions. With -mavx I get the message. Have you not been able to re-create the issue?Boyar
I can indeed reproduce your problem on godbolt: godbolt.org/z/ro1GP3M64 Clang actually fails to compile due to a redefinition. The solution is easy: Rename your function (starting names with _ or names containing __ anywhere are actually reserved for the compiler).Seventeen
-msse2 doesn't override other options, it just adds SSE2 and earlier to the set of already enabled options. It doesn't imply -mno-avx. If you want -mno-avx, use -mno-avx, or change your other options to not enable it in the first place, like -march=x86-64-v2 (SSE4.2) instead of -march=x86-64-v3 (AVX2+FMA+BMI2). Or even baseline -march=x86-64 with just SSE2. (These -march options also don't override -mavx, e.g. -mavx -march=nehalem still enables AVX plus everything Nehalem has, and sets -mtune=nehalem)Dilantin
I've read that a few distros are looking at shipping a version with -march=x86-64-v3 since most new CPUs have had that feature level since about 2013. (But not all, e.g. pre Ice Lake, Pentium and Celeron models omitted AVX+BMI. And until Gracemont, low-power Intel cores were also just SSE4.2.)Dilantin
Your GCC version is 9.4.0-1ubuntu1~20.04.1 so it's definitely not enabling AVX2 on its own. It must be running with options that include -mavx, such as from -march=native.Dilantin
@PeterCordes Surely -msse2 alone does imply -mno-avx? It would be shocking to me otherwise as compilers would be creating binaries containing instructions that are invalid for the target architectures all the time. Incidentally, compiling my program with -mno-avx still produces illegal instructions. Did you mean it won't reduce the baseline set for the compiler if that it higher than sse2?Boyar
Your last sentence is what I meant. -msse2 doesn't enable AVX, but it won't disable it if it was already enabled (e.g. as part of how GCC was configured, or by an earlier arg).Dilantin
D
7

Rename your function to not clash with intrinsics

Like lzcnt_epi32_sse2 or just lzcnt_epi32. The epi32 is already enough to remind you it's related to Intel intrinsics like taking an __m128i arg, but the lack of _mm in the name lets you know it's just a function, and not one of Intel's SVML functions or something.

If you want to mix vector widths and need to distinguish that in your helper functions (since C doesn't allow overloading), perhaps __m1128i lzcntd_m128i( __m128i v );. I've also seen names like mm_lzcnt_epi32 without the leading _, but it would be very easy to miss that when reading.

static inline
__m128i lzcnt_epi32(__m128i v) {
#ifdef __AVX512VL__  // and __AVX512CD__ but that's effectively baseline
    return _mm_lzcnt_epi32(v);  // use the HW version if build options allow
#else
    // Based on https://mcmap.net/q/1157338/-count-leading-zero-bits-for-each-element-in-avx2-vector-emulate-_mm256_lzcnt_epi32
    // prevent value from being rounded up to the next power of two
    v = _mm_andnot_si128(_mm_srli_epi32(v, 8), v); // keep 8 MSB
    v = _mm_castps_si128(_mm_cvtepi32_ps(v)); // convert signed integer to float ??
    v = _mm_srli_epi32(v, 23); // shift down the exponent
    v = _mm_subs_epu16(_mm_set1_epi32(158), v); // undo bias
    v = _mm_min_epi16(v, _mm_set1_epi32(32)); // clamp at 32
    return v;
#endif
}

Don't define your own functions with names that start with _, those are reserved for use by the implementation. That reserved part of the namespace is a reasonable place for non-portable extensions that won't clash with any existing code, which is probably why Intel chose it for their intrinsics. (What are the rules about using an underscore in a C++ identifier? - C has pretty much the same rules as C++ for this, IIRC. Since your definition isn't static, it's in the global namespace where _anything is reserved. Not that I'd recommend static inline with clashing names.)

Don't follow their naming scheme for your own functions that take __m128i args, and definitely never define your own version of an intrinsic. Those do get defined even without -mavx512vl enabled globally, so they're usable inside functions that use __attribute__((target("avx512vl"))), and unfortunately you end up with silent use of ISA extensions you didn't want, with no good way for GCC to detect a potential problem to even warn about it, I think.


The intrinsic's definition

_mm_lzcnt_epi32 is a real intrinsic for an AVX-512 instruction. It's declared and defined in a GCC header as an extern inline wrapper function (around a GNU C __builtin) inside a #pragma GCC target("avx512vl,avx512cd") block, with __attribute__((always_inline)). (If avx512vl wasn't enabled globally, it will #pragma GCC pop_options afterwards so it's only enabled for that block of definitions.)

Apparently the target-attribute part of the declaration sticks, but not the always-inline attribute which normally makes inlining fail with a compile-time error. This part may be a GCC bug. And somehow it's not an error to redefine the function, because of the gnu_inline attribute in the header's definition1. It is an error with clang which uses different headers.

So you get a call _mm_lzcnt_epi32 in main to a non-inline function that uses AVX-512 instructions. (Yes, GCC9.4 uses EVEX vmovdqa64 xmm1, xmm0 as well as VEX vpsrld xmm0, xmm0, 8, as you show in your code block. This is a missed-optimization bug that was fixed in GCC10: vmovdqa xmm1, xmm0 is fewer bytes of machine code. Although I think the whole copy is avoidable by shifting into a separate destination so there is still a missed optimization, but GCC10 makes asm that will run on Godbolt's Zen 3 AWS instances, not just its SKX / Ice Lake instances.)


This is what's supposed to happen with arg = _mm_lzcnt_epi32(arg); if you haven't defined your own version of it - a "target-specific options mismatch" error:

/opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/avx512vlintrin.h:8376:1: error: inlining failed in call to always_inline '_mm_lzcnt_epi32': target specific option mismatch
 8376 | _mm_lzcnt_epi32 (__m128i __A)
      | ^~~~~~~~~~~~~~~
<source>:28:9: note: called from here
   28 |   arg = _mm_lzcnt_epi32(arg);
      |         ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/immintrin.h:63,
                 from <source>:5:
/opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/avx512vlintrin.h:8376:1: error: inlining failed in call to always_inline '_mm_lzcnt_epi32': target specific option mismatch
 8376 | _mm_lzcnt_epi32 (__m128i __A)
      | ^~~~~~~~~~~~~~~
<source>:28:9: note: called from here
   28 |   arg = _mm_lzcnt_epi32(arg);
      |         ^~~~~~~~~~~~~~~~~~~~
Compiler returned: 1

Or if you use the raw builtin manually:

<source>:29:18: error: '__builtin_ia32_vplzcntd_128_mask' needs isa option -mavx512vl -mavx512cd
   29 |   arg = (__m128i)__builtin_ia32_vplzcntd_128_mask((__v4si)arg, (__v4si)_mm_setzero_si128(), -1);
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note that -msse2 is baseline for x86-64. You only need to enable it if targeting -m32 with a GCC config that doesn't do that by default. It doesn't do any harm for x86-64, but it also doesn't override AVX enabled by any earlier options like -march=x86-64-v3 or -mavx. For that you want -mno-avx. But that just sets the baseline for all code: pragma and per-function __attribute__ can still enable use of later ISA extensions for specific functions. gcc -msse2 -mno-avx is equivalent to the default and won't help work around this bug of naming a function that clashes with an intrinsic.

Some Linux distros are planning to ship versions that are built with -march=x86-64-v3 (Haswell baseline: AVX2+FMA+BMI2, wikipedia) although IDK if they're planning to configure GCC with that higher baseline as a no-options default the way many do for SSE2 with gcc -m32. But your GCC 9.4.0-1ubuntu1~20.04.1 is definitely not configured that way, and what I can see on Godbolt matches what you report your GCC doing.

Which CPUs is this relevant for?

You could argue that it isn't worth worrying about this issue specifically because nobody will be using such old hardware for anything serious,

First of all, your code uses AVX-512 instructions (vmovdqa64) and will crash on Intel's latest desktop / laptop CPUs because they removed AVX-512 before defining a way (AVX10.1) to expose 128 and 256-bit EVEX instructions with all the great new features like masking, better shuffles, vpternlogd, and niche instructions like vplzcntd. They'll run fine on Zen 4, though.

Secondly, low-power Intel CPUs based on Tremont and earlier lack AVX/BMI, so there are recent low-power servers and low-end netbooks without AVX.

Also, Intel Pentium and Celeron before Ice Lake had AVX+BMI disabled. (BMI perhaps a victim of disabling decode of VEX prefixes as a way to disable AVX+FMA?) This was pretty bad, not helping the x86 ecosystem get closer to making BMI (or AVX) baseline. BMI1/BMI2 are most useful if used everywhere for stuff like more efficient variable-count shifts, not just in a couple hot loops like SIMD.

(Ice Lake Pentium/Celeron are still half-width, but that means 256-bit so x86-64-v3 without AVX-512. Low-end / low-power Alder Lake N has all Gracemont E-cores but that's the same x86-64-v3 feature level as their P-cores, thanks to Intel crippling the AVX-512 on the P-cores even in systems with no E-cores, while enhancing their E-cores to add x86-64-v3 features.)


Footnote 1: No redefinition error?

It seems that __attribute__((__gnu_inline__)) is responsible for allowing a second definition. GCC compiles this without complaint:

__attribute__ ((__gnu_inline__))
extern __inline int foo (int x) {
  return x+1;
}

int foo(int x) { return x + 2; }

(__gnu_inline__ is a version of gnu_inline that doesn't pollute the namespace, for use in -std=gnu11 mode, like __asm__ vs. asm. Most GNU keywords have an __x__ version so headers don't break even if user code did a #define on any non-reserved part of the namespace.)

From the GCC manual: function attributes:

gnu_inline

This attribute should be used with a function that is also declared with the inline keyword. It directs GCC to treat the function as if it were defined in gnu90 mode even when compiling in C99 or gnu99 mode.

If the function is declared extern, then this definition of the function is used only for inlining. In no case is the function compiled as a standalone function, not even if you take its address explicitly. Such an address becomes an external reference, as if you had only declared the function, and had not defined it. This has almost the effect of a macro. The way to use this is to put a function definition in a header file with this attribute, and put another copy of the function, without extern, in a library file. The definition in the header file causes most calls to the function to be inlined. If any uses of the function remain, they refer to the single copy in the library. Note that the two definitions of the functions need not be precisely the same, although if they do not have the same effect your program may behave oddly.

In C, if the function is neither extern nor static, then the function is compiled as a standalone function, as well as being inlined where possible.

This is how GCC traditionally handled functions declared inline. Since ISO C99 specifies a different semantics for inline, this function attribute is provided as a transition measure and as a useful feature in its own right. This attribute is available in GCC 4.1.3 and later. It is available if either of the preprocessor macros __GNUC_GNU_INLINE__ or __GNUC_STDC_INLINE__ are defined. See An Inline Function is As Fast As a Macro.

In C++, this attribute does not depend on extern in any way, but it still requires the inline keyword to enable its special behavior.

So I guess the version in the header wasn't a candidate for inlining because of mismatching target options, but providing a non-inline definition let GCC call it anyway. So this might not be a GCC bug. And it's probably not something GCC should even warn about since most .c files that provide the non-inline definition (if there is one; not the case for intrinsics I assume) will include the header that defines the extern inline version.

Even if it were or is a bug that GCC didn't error or warn about this, don't define your own functions in a reserved part of the namespace in the first place. The most we could hope for is GCC being more helpful like erroring at compile-time instead of silently making a binary you didn't intend.

The behaviour is undefined in this case (defining functions with reserved names). Perhaps GCC could warn if it differentiated based on path, knowing which headers were "part of the implementation" vs. 3rd-party libraries. But I think glibc also uses plenty of __ names in headers in /usr/include, so I don't think that's viable.


At first I thought GCC was allowing it because different target attributes on definitions for the same name is how GCC does function multiversioning. But this is different. If it was doing multiversioning, it would be using a non-AVX512 version because main was compiled with just SSE2 in effect. The test-case above compiles with just gnu_inline, no target-attribute stuff required.

Dilantin answered 1/1 at 5:17 Comment(11)
Note that only function names starting with __ or _ followed by an uppercase letter are reserved. Names starting with _ followed by a lowercase letter are not reserved per se.Sclerophyll
Thanks for taking the time to answer. I already found numerous ways of getting it to compile correctly, including making the function static inline as I mentioned in my question. The mm prefix tells you that it operates on at most 128 bit vectors which is not obvious from a name like lzcnt_epi32. Now I know it's an AVX512 intrinsic, I will include the macro so the AVX512 instruction gets used if it is available. I assume that means I have to rename the function so as to avoid an infinite loop in the preprocessor!Boyar
I guess it would be best to include the vector width and the extension required in the function name, something like _mm128_lzcnt_epi32_sse2. It seems conceivable to me however, that there could be code in the wild that defines functions with names of real intrinsics as it is not unusual to adopt the same naming convention as the Intel intrinsics. The SVML intrinsics are an example. What if I'd written my program before AVX512 was even a thing? The point is, the compiler should not do what it does, or there's something wrong with the way the header files are defined.Boyar
Great job finding the cause of the redefinition issue. I guess the question is now, is this a gcc bug or isn't it?Boyar
@fuz: The rule you quoted applies at any scope, including for local vars. The function in the question is a non-static function at global scope, where all names beginning with even a single _ are reserved. en.cppreference.com/w/c/language/identifier include the rule: All external identifiers that begin with an underscore.Dilantin
@SimonGoater: As I concluded in my answer, I think it's not a GCC bug that it doesn't warn, just a consequence of how gnu_inline works. Since it's probably normal for a .c with the non-inline definition to have included the .h with the gnu_inline def.Dilantin
@SimonGoater: What if I'd written my program before AVX512 was even a thing? - Then you still shouldn't have defined your own function with the same naming convention as Intel's intrinsics, to avoid future clashes, like my answer says. Especially a non-static one since even ISO C reserves that part of the namespace. I've thought that was a bad idea even before I (just now) found out how bad the symptoms were with GCC. If you want the naming to include something for 128-bit vector width (instead of that being the default), perhaps lzcntd_m128i - the asm mnemonic and the C operand type.Dilantin
@SimonGoater: If you like the mm convention, you could perhaps use mm_lzcnt_epi32 (no leading underscore), but that's hard to visually distinguish from the actual _mm_lzcnt_epi32 intrinsic.Dilantin
@SimonGoater: It might be a GCC bug that the #pragma target applies to the second definition where that's not in effect; perhaps that's worth a bug report if there isn't a duplicate. Even a warning about conflicting target options on the inline vs. non-extern non-inline definitions could be useful, so if you report it, make sure to give context of what you were doing and the problem you want to help future users avoid.Dilantin
@PeterCordes I accept that using the prefix _mm for a globally defined function is wrong and should be avoided. In my humble opinion though, the compiler should throw a compilation warning at the very least when I try to compile my program above, and preferably not compile at all. I don't think it's right to put the onus on the programmer to name functions a certain way any more than it is right to assume a user will only enter valid data into a program. Your answer is thorough and well-informed as usual and I'm happy to accept it despite those minor reservations.Boyar
@SimonGoater: Yeah, 100% agreed that the failure mode is bad enough that we'd like at least a warning. But I'm not sure we can get one because of the way gnu_inline needs to work, without creating false positives for normal use of it. Maybe based on the target options stuff: that could be a GCC bug, or at least something that wouldn't happen in normal use-cases of gnu_inline, so is something GCC could detect and warn about. I think it is worth opening an issue on GCC's bugzilla about it. (gcc.gnu.org/bugzilla)Dilantin

© 2022 - 2024 — McMap. All rights reserved.