c++x86language-lawyerundefined-behaviorintrinsics

Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?


Is it legal to reinterpret_cast a float* to a __m256* and access float objects through a different pointer type?

constexpr size_t _m256_float_step_sz = sizeof(__m256) / sizeof(float);
alignas(__m256) float stack_store[100 * _m256_float_step_sz ]{};
__m256& hwvec1 = *reinterpret_cast<__m256*>(&stack_store[0 * _m256_float_step_sz]);

using arr_t = float[_m256_float_step_sz];
arr_t& arr1 = *reinterpret_cast<float(*)[_m256_float_step_sz]>(&hwvec1);

Do hwvec1 and arr1 depend on undefined behaviors?

Do they violate strict aliasing rules? [basic.lval]/11

Or there is only one defined way of intrinsic:

__m256 hwvec2 = _mm256_load_ps(&stack_store[0 * _m256_float_step_sz]);
_mm256_store_ps(&stack_store[1 * _m256_float_step_sz], hwvec2);

godbolt


Solution

  • ISO C++ doesn't define __m256, so we need to look at what does define their behaviour on the implementations that support them.

    Intel's intrinsics define vector-pointers like __m256* as being allowed to alias anything else, the same way ISO C++ defines char* as being allowed to alias. (But not vice-versa: it's UB and breaks in practice to point an int* at a __m256i and deref it.)

    So yes, it's safe to dereference a __m256* instead of using a _mm256_load_ps() aligned-load intrinsic.

    But especially for float/double, it's often easier to use the intrinsics because they take care of casting from float*, too. For integers, the AVX512 load/store intrinsics are defined as taking void*, but AVX2 and earlier need a cast like (__m256i*)&arr[i] which is pretty clunky API design and clutters up code using it.

    A few non-AVX512 intrinsics have also been added using void* like movd/movq load/store alignment and aliasing safe intrinsics such as _mm_loadu_si32(void*). Previously I think Intel assumed you'd use _mm_cvtsi32_si128 which required getting an int loaded safely yourself, which meant using memcpy to avoid UB (at least on compilers other than classic ICC and MSVC, if they allow unaligned int* as well as not enforcing strict aliasing). This might have been around the time Intel started looking at migrating to LLVM for ICX/ICPX / OneAPI, and realizing how much of a mess it was to deal with narrow loads on compilers that enforce strict aliasing.


    To learn some about what the intrinsics API required, we can look at the non-portable implementation details of GCC. Presumably they inferred from Intel examples or documentation what behaviour was necessary, or some Intel engineers may have sent patches. (You shouldn't rely on the GCC details because it's not documented and other compilers that implement __m256 and other intrinsic types might do things differently. But we can see that GCC had to explicitly allow aliasing that wouldn't otherwise be safe.)

    In GCC, this is implemented by defining __m256 with a may_alias attribute: from gcc7.3's avxintrin.h (one of the headers that <immintrin.h> includes):

    /* The Intel API is flexible enough that we must allow aliasing with other
       vector types, and their scalar components.  */
    typedef float __m256 __attribute__ ((__vector_size__ (32),
                                         __may_alias__));
    typedef long long __m256i __attribute__ ((__vector_size__ (32),
                                              __may_alias__));
    typedef double __m256d __attribute__ ((__vector_size__ (32),
                                           __may_alias__));
    
    /* Unaligned version of the same types.  */
    typedef float __m256_u __attribute__ ((__vector_size__ (32),
                                           __may_alias__,
                                           __aligned__ (1)));
    typedef long long __m256i_u __attribute__ ((__vector_size__ (32),
                                                __may_alias__,
                                                __aligned__ (1)));
    typedef double __m256d_u __attribute__ ((__vector_size__ (32),
                                             __may_alias__,
                                             __aligned__ (1)));
    

    (In case you were wondering, this is why dereferencing a __m256* is like _mm256_store_ps, not storeu. storeu casts the pointer arg to the _u type defined with aligned(1).)

    GNU C native vectors without may_alias are allowed to alias their scalar type, e.g. even without the may_alias, you could safely cast between float* and a hypothetical v8sf type. But may_alias makes it safe to load from an array of int[], char[], or whatever.


    Other behaviour Intel's intrinsics require to be defined

    Using Intel's API for _mm_storeu_si128( (__m128i*)&arr[i], vec); requires you to create potentially-unaligned pointers which would fault if you deferenced them. And _mm_storeu_ps to a location that isn't 4-byte aligned requires creating an under-aligned float*.

    Just creating unaligned pointers, or pointers outside an object, is UB in ISO C++, even if you don't dereference them. I guess this allows implementations on exotic hardware which do some kinds of checks on pointers when creating them (possibly instead of when dereferencing), or maybe which can't store the low bits of pointers. (I have no idea if any specific hardware exists where more efficient code is possible because of this UB.)

    But implementations which support Intel's intrinsics must define the behaviour, at least for the __m* types and float*/double*. This is trivial for compilers targeting any normal modern CPU, including x86 with a flat memory model (no segmentation); pointers in asm are just integers kept in the same registers as data. (m68k has address vs. data registers, but it never faults from keeping bit-patterns that aren't valid addresses in A registers, as long as you don't deref them.)


    Going the other way: element access of a vector.

    Note that may_alias, like the char* aliasing rule, only goes one way: it is not guaranteed to be safe to use int32_t* to read a __m256. It might not even be safe to use float* to read a __m256. Just like it's not safe to do char buf[1024]; int *p = (int*)buf;.

    See GCC AVX __m256i cast to int array leads to wrong values for a real-world example of GCC breaking code that points an int* into a __m256i vec; object. Not a dereferenced __m256i* ; that would be safe if the only __m256i accesses were via __m256i*. Because it's a may_alias type, the compiler can't infer that the underlying object is an __m256i; that's the whole point, and why it's safe to point it at an int arr[] or whatever.

    GCC/clang define __m128i / __m256i as a vector of 2 or 4 long long elements, and __m128 / __m256 as a vector of 4 or 8 float elements, etc. GCC manual for Vector Extensions. These might count as real long long or float objects you can safely point a long long* or float* at, but GCC doesn't explicitly document that even for its native vector types (but it does define [] indexing). Even if they did, that would be an implementation detail for how GCC and Clang define Intel's vector types in terms of GNU or Clang vectors, not documented or guaranteed portable. Except to MSVC which allow anything to alias anything, like -fno-strict-aliasing. (And I think classic ICC was also like that, unlike LLVM-based ICX)

    Reading/writing through a char* can alias anything, but when you have a char object, strict-aliasing does make it UB to read it through other types. (I'm not sure if the major implementations on x86 do define that behaviour, but you don't need to rely on it because they optimize away memcpy of 4 bytes into an int32_t. You can and should use memcpy to express an unaligned load from a char[] buffer, because auto-vectorization with a wider type is allowed to assume 2-byte alignment for int16_t*, and make code that fails if it's not: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?)

    A char arr[] may not be a great analogy because arr[i] is defined in terms of *(arr+i), so there actually is a char* deref involved in accessing the array as char objects. Perhaps some char members of a struct would be a better example, then.


    To insert/extract vector elements, use shuffle intrinsics, SSE2 _mm_insert_epi16 / _mm_extract_epi16 or SSE4.1 insert / _mm_extract_epi8/32/64. For float, there are no insert/extract intrinsics that you should use with scalar float.

    Or store to an array and read the array. (print a __m128i variable). This does actually optimize away to vector extract instructions.

    GNU C vector syntax provides the [] operator for vectors, like __m256 v = ...; v[3] = 1.25;. MSVC defines vector types as a union with a .m128_f32[] member for per-element access.

    There are wrapper libraries like Agner Fog's (now Apache-licensed) Vector Class Library which provide portable operator[] overloads for their vector types, and operator + / - / * / << and so on. It's quite nice, especially for integer types where having different types for different element widths make v1 + v2 work with the right size. (GNU C native vector syntax does that for float/double vectors, and defines __m128i as a vector of signed int64_t, but MSVC doesn't provide operators on the base __m128 types.)


    You can also use union type-punning between a vector and an array of some type, which is safe in ISO C99, and in GNU C++, but not in ISO C++. I think it's officially safe in MSVC, too, because I think the way they define __m128 as a normal union.

    There's no guarantee you'll get efficient code from any of these element-access methods, though. Do not use inside inner loops, and have a look at the resulting asm if performance matters.