C program compiled with gcc -msse2 contains AVX1 instructions

I adapted a function I found on SO for SSE2 and included it in my program. The function uses SSE2 intrinsics to calculate the leading zero count of each of the 8 x 16bit integers in the vector. When I compiled the program, which produced no warnings, and ran it on my old laptop which I often use for development, it crashed with the message 'Illegal instruction (core dumped)'. On inspecting the assembly, I noticed my function appeared to have AVX1 'VEX' encoded SSE2 instructions as shown below.

    .globl  _mm_lzcnt_epi32
    .type   _mm_lzcnt_epi32, @function
_mm_lzcnt_epi32:
.LFB5318:
    .cfi_startproc
    endbr64
    vmovdqa64   %xmm0, %xmm1
    vpsrld  $8, %xmm0, %xmm0
    vpandn  %xmm1, %xmm0, %xmm0
    vmovdqa64   .LC0(%rip), %xmm1
    vcvtdq2ps   %xmm0, %xmm0
    vpsrld  $23, %xmm0, %xmm0
    vpsubusw    %xmm0, %xmm1, %xmm0
    vpminsw .LC1(%rip), %xmm0, %xmm0
    ret
    .cfi_endproc

If I change the header immintrin.h to emmintrin.h, it compiles my code properly to SSE2 instructions

    .globl  _mm_lzcnt_epi32
    .type   _mm_lzcnt_epi32, @function
_mm_lzcnt_epi32:
.LFB567:
    .cfi_startproc
    endbr64
    movdqa  %xmm0, %xmm1
    psrld   $8, %xmm0
    pandn   %xmm1, %xmm0
    cvtdq2ps    %xmm0, %xmm1
    movdqa  .LC0(%rip), %xmm0
    psrld   $23, %xmm1
    psubusw %xmm1, %xmm0
    pminsw  .LC1(%rip), %xmm0
    ret
    .cfi_endproc

and runs successfully. My program is as follows.

#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <stdint.h>
#include <immintrin.h>

// gcc ssebug.c -o ssebug.bin -O3 -msse2 -Wall

__m128i _mm_lzcnt_epi32(__m128i v) {
    // Based on https://stackoverflow.com/questions/58823140/count-leading-zero-bits-for-each-element-in-avx2-vector-emulate-mm256-lzcnt-ep
    // prevent value from being rounded up to the next power of two
    v = _mm_andnot_si128(_mm_srli_epi32(v, 8), v); // keep 8 MSB
    v = _mm_castps_si128(_mm_cvtepi32_ps(v)); // convert signed integer to float ??
    v = _mm_srli_epi32(v, 23); // shift down the exponent
    v = _mm_subs_epu16(_mm_set1_epi32(158), v); // undo bias
    v = _mm_min_epi16(v, _mm_set1_epi32(32)); // clamp at 32
    return v;
}

int main(int argc, char **argv) {
  uint32_t i, a[4];  
  __m128i arg;
  uint32_t argval = 123;
  if (argc >= 2) argval = atoi(argv[1]);
  arg = _mm_set1_epi32(argval);
  arg = _mm_lzcnt_epi32(arg);
  _mm_storeu_si128((void*)a, arg);
  for(i=0; i<4; i++) {
    printf("%u ", a[i]);
  }
  printf("\n");
}

This explanation, Header files for x86 SIMD intrinsics, appears to suggest that for gcc at least, it is safe to just use immintrin.h for everything, which appears to be false. My questions are as follows.

Is it supposed to be safe to use immintrin.h for everything, or does using it tell the compiler to assume at least AVX1?
Isn't it the compiler's responsibility to produce ONLY instructions which are valid for the target architecture? If not, why not?
Why does it work (produce only SSE2) if I use immintrin.h but make my function static inline?
Is there a way to scan an assembly file to reveal what CPU extensions it requires?
Who should I contact about such issues in future?

I think this is potentially quite a serious issue as it isn't always feasible to check the assembler contains only valid instructions for the target architecture. I only found this because my program crashed, and I was using an old machine which doesn't support AVX1. If the function was in some hardly ever executed branch, I might have missed it. You could argue that it isn't worth worrying about this issue specifically because nobody will be using such old hardware for anything serious, but the issues it raises could well apply to newer architectures too. Thanks for your time. I am using gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0.

Solution

Rename your function to not clash with intrinsics

Like lzcnt_epi32_sse2 or just lzcnt_epi32. The epi32 is already enough to remind you it's related to Intel intrinsics like taking an __m128i arg, but the lack of _mm in the name lets you know it's just a function, and not one of Intel's SVML functions or something.

If you want to mix vector widths and need to distinguish that in your helper functions (since C doesn't allow overloading), perhaps __m1128i lzcntd_m128i( __m128i v );. I've also seen names like mm_lzcnt_epi32 without the leading _, but it would be very easy to miss that when reading.

static inline
__m128i lzcnt_epi32(__m128i v) {
#ifdef __AVX512VL__  // and __AVX512CD__ but that's effectively baseline
    return _mm_lzcnt_epi32(v);  // use the HW version if build options allow
#else
    // Based on https://stackoverflow.com/questions/58823140/count-leading-zero-bits-for-each-element-in-avx2-vector-emulate-mm256-lzcnt-ep
    // prevent value from being rounded up to the next power of two
    v = _mm_andnot_si128(_mm_srli_epi32(v, 8), v); // keep 8 MSB
    v = _mm_castps_si128(_mm_cvtepi32_ps(v)); // convert signed integer to float ??
    v = _mm_srli_epi32(v, 23); // shift down the exponent
    v = _mm_subs_epu16(_mm_set1_epi32(158), v); // undo bias
    v = _mm_min_epi16(v, _mm_set1_epi32(32)); // clamp at 32
    return v;
#endif
}

Don't define your own functions with names that start with _, those are reserved for use by the implementation. That reserved part of the namespace is a reasonable place for non-portable extensions that won't clash with any existing code, which is probably why Intel chose it for their intrinsics. (What are the rules about using an underscore in a C++ identifier? - C has pretty much the same rules as C++ for this, IIRC. Since your definition isn't static, it's in the global namespace where _anything is reserved. Not that I'd recommend static inline with clashing names.)

Don't follow their naming scheme for your own functions that take __m128i args, and definitely never define your own version of an intrinsic. Those do get defined even without -mavx512vl enabled globally, so they're usable inside functions that use __attribute__((target("avx512vl"))), and unfortunately you end up with silent use of ISA extensions you didn't want, with no good way for GCC to detect a potential problem to even warn about it, I think.

The intrinsic's definition

_mm_lzcnt_epi32 is a real intrinsic for an AVX-512 instruction. It's declared and defined in a GCC header as an extern inline wrapper function (around a GNU C __builtin) inside a #pragma GCC target("avx512vl,avx512cd") block, with __attribute__((always_inline)). (If avx512vl wasn't enabled globally, it will #pragma GCC pop_options afterwards so it's only enabled for that block of definitions.)

Apparently the target-attribute part of the declaration sticks, but not the always-inline attribute which normally makes inlining fail with a compile-time error. This part may be a GCC bug. And somehow it's not an error to redefine the function, because of the gnu_inline attribute in the header's definition¹. It is an error with clang which uses different headers.

So you get a call _mm_lzcnt_epi32 in main to a non-inline function that uses AVX-512 instructions. (Yes, GCC9.4 uses EVEX vmovdqa64 xmm1, xmm0 as well as VEX vpsrld xmm0, xmm0, 8, as you show in your code block. This is a missed-optimization bug that was fixed in GCC10: vmovdqa xmm1, xmm0 is fewer bytes of machine code. Although I think the whole copy is avoidable by shifting into a separate destination so there is still a missed optimization, but GCC10 makes asm that will run on Godbolt's Zen 3 AWS instances, not just its SKX / Ice Lake instances.)

This is what's supposed to happen with arg = _mm_lzcnt_epi32(arg); if you haven't defined your own version of it - a "target-specific options mismatch" error:

/opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/avx512vlintrin.h:8376:1: error: inlining failed in call to always_inline '_mm_lzcnt_epi32': target specific option mismatch
 8376 | _mm_lzcnt_epi32 (__m128i __A)
      | ^~~~~~~~~~~~~~~
<source>:28:9: note: called from here
   28 |   arg = _mm_lzcnt_epi32(arg);
      |         ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/immintrin.h:63,
                 from <source>:5:
/opt/compiler-explorer/gcc-9.4.0/lib/gcc/x86_64-linux-gnu/9.4.0/include/avx512vlintrin.h:8376:1: error: inlining failed in call to always_inline '_mm_lzcnt_epi32': target specific option mismatch
 8376 | _mm_lzcnt_epi32 (__m128i __A)
      | ^~~~~~~~~~~~~~~
<source>:28:9: note: called from here
   28 |   arg = _mm_lzcnt_epi32(arg);
      |         ^~~~~~~~~~~~~~~~~~~~
Compiler returned: 1

Or if you use the raw builtin manually:

<source>:29:18: error: '__builtin_ia32_vplzcntd_128_mask' needs isa option -mavx512vl -mavx512cd
   29 |   arg = (__m128i)__builtin_ia32_vplzcntd_128_mask((__v4si)arg, (__v4si)_mm_setzero_si128(), -1);
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note that -msse2 is baseline for x86-64. You only need to enable it if targeting -m32 with a GCC config that doesn't do that by default. It doesn't do any harm for x86-64, but it also doesn't override AVX enabled by any earlier options like -march=x86-64-v3 or -mavx. For that you want -mno-avx. But that just sets the baseline for all code: pragma and per-function __attribute__ can still enable use of later ISA extensions for specific functions. gcc -msse2 -mno-avx is equivalent to the default and won't help work around this bug of naming a function that clashes with an intrinsic.

Some Linux distros are planning to ship versions that are built with -march=x86-64-v3 (Haswell baseline: AVX2+FMA+BMI2, wikipedia) although IDK if they're planning to configure GCC with that higher baseline as a no-options default the way many do for SSE2 with gcc -m32. But your GCC 9.4.0-1ubuntu1~20.04.1 is definitely not configured that way, and what I can see on Godbolt matches what you report your GCC doing.

Which CPUs is this relevant for?

You could argue that it isn't worth worrying about this issue specifically because nobody will be using such old hardware for anything serious,

First of all, your code uses AVX-512 instructions (vmovdqa64) and will crash on Intel's latest desktop / laptop CPUs because they removed AVX-512 before defining a way (AVX10.1) to expose 128 and 256-bit EVEX instructions with all the great new features like masking, better shuffles, vpternlogd, and niche instructions like vplzcntd. They'll run fine on Zen 4, though.

Secondly, low-power Intel CPUs based on Tremont and earlier lack AVX/BMI, so there are recent low-power servers and low-end netbooks without AVX.

Also, Intel Pentium and Celeron before Ice Lake had AVX+BMI disabled. (BMI perhaps a victim of disabling decode of VEX prefixes as a way to disable AVX+FMA?) This was pretty bad, not helping the x86 ecosystem get closer to making BMI (or AVX) baseline. BMI1/BMI2 are most useful if used everywhere for stuff like more efficient variable-count shifts, not just in a couple hot loops like SIMD.

(Ice Lake Pentium/Celeron are still half-width, but that means 256-bit so x86-64-v3 without AVX-512. Low-end / low-power Alder Lake N has all Gracemont E-cores but that's the same x86-64-v3 feature level as their P-cores, thanks to Intel crippling the AVX-512 on the P-cores even in systems with no E-cores, while enhancing their E-cores to add x86-64-v3 features.)

Footnote 1: No redefinition error?

It seems that __attribute__((__gnu_inline__)) is responsible for allowing a second definition. GCC compiles this without complaint:

__attribute__ ((__gnu_inline__))
extern __inline int foo (int x) {
  return x+1;
}

int foo(int x) { return x + 2; }

(__gnu_inline__ is a version of gnu_inline that doesn't pollute the namespace, for use in -std=gnu11 mode, like __asm__ vs. asm. Most GNU keywords have an __x__ version so headers don't break even if user code did a #define on any non-reserved part of the namespace.)

From the GCC manual: function attributes:

gnu_inline

This attribute should be used with a function that is also declared with the inline keyword. It directs GCC to treat the function as if it were defined in gnu90 mode even when compiling in C99 or gnu99 mode.

If the function is declared extern, then this definition of the function is used only for inlining. In no case is the function compiled as a standalone function, not even if you take its address explicitly. Such an address becomes an external reference, as if you had only declared the function, and had not defined it. This has almost the effect of a macro. The way to use this is to put a function definition in a header file with this attribute, and put another copy of the function, without extern, in a library file. The definition in the header file causes most calls to the function to be inlined. If any uses of the function remain, they refer to the single copy in the library. Note that the two definitions of the functions need not be precisely the same, although if they do not have the same effect your program may behave oddly.

In C, if the function is neither extern nor static, then the function is compiled as a standalone function, as well as being inlined where possible.

This is how GCC traditionally handled functions declared inline. Since ISO C99 specifies a different semantics for inline, this function attribute is provided as a transition measure and as a useful feature in its own right. This attribute is available in GCC 4.1.3 and later. It is available if either of the preprocessor macros __GNUC_GNU_INLINE__ or __GNUC_STDC_INLINE__ are defined. See An Inline Function is As Fast As a Macro.

In C++, this attribute does not depend on extern in any way, but it still requires the inline keyword to enable its special behavior.

So I guess the version in the header wasn't a candidate for inlining because of mismatching target options, but providing a non-inline definition let GCC call it anyway. So this might not be a GCC bug. And it's probably not something GCC should even warn about since most .c files that provide the non-inline definition (if there is one; not the case for intrinsics I assume) will include the header that defines the extern inline version.

Even if it were or is a bug that GCC didn't error or warn about this, don't define your own functions in a reserved part of the namespace in the first place. The most we could hope for is GCC being more helpful like erroring at compile-time instead of silently making a binary you didn't intend.

The behaviour is undefined in this case (defining functions with reserved names). Perhaps GCC could warn if it differentiated based on path, knowing which headers were "part of the implementation" vs. 3rd-party libraries. But I think glibc also uses plenty of __ names in headers in /usr/include, so I don't think that's viable.

At first I thought GCC was allowing it because different target attributes on definitions for the same name is how GCC does function multiversioning. But this is different. If it was doing multiversioning, it would be using a non-AVX512 version because main was compiled with just SSE2 in effect. The test-case above compiles with just gnu_inline, no target-attribute stuff required.