c++ssesimdmemory-alignmentavx

How to solve the 32-byte-alignment issue for AVX load/store operations?


I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example:

#include <iostream> 
#include <immintrin.h>

inline void ones(float *a)
{
     __m256 out_aligned = _mm256_set1_ps(1.0f);
     _mm256_store_ps(a,out_aligned);
}

int main()
{
     size_t ss = 8;
     float *a = new float[ss];
     ones(a);

     delete [] a;

     std::cout << "All Good!" << std::endl;
     return 0;
}

Certainly, sizeof(float) is 4 on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I'm compiling with gcc using -O3 -march=native flags. Of course the error goes away with unaligned memory access i.e. specifying _mm256_storeu_ps. I also do not have this problem on xmm registers, i.e.

inline void ones_sse(float *a)
{
     __m128 out_aligned = _mm_set1_ps(1.0f);
     _mm_store_ps(a,out_aligned);
}

Am I doing anything foolish? what is the work-around for this?


Solution

  • Yes, you can use _mm256_loadu_ps / storeu for unaligned loads/stores (AVX: data alignment: store crash, storeu, load, loadu doesn't). If the compiler doesn't do a bad job (cough GCC default tuning), AVX _mm256_loadu/storeu on data that happens to be aligned is just as fast as alignment-required load/store, so aligning data when convenient still gives you the best of both worlds for functions that normally run on aligned data but let hardware handle the rare cases where they don't. (Instead of always running extra instructions to check stuff).

    Alignment is especially important for 512-bit AVX-512 vectors, like 15 to 20% speed on SKX even over large arrays where you'd expect L3 / DRAM bandwidth to be the bottleneck, vs. a few percent with AVX2 CPUs for large arrays. (It can still matter significantly with AVX2 on modern CPUs if your data is hot in L2 or especially L1d cache, especially if you can come close to maxing out 2 loads and/or 1 store per clock. Cache-line splits cost about twice the throughput resources, plus needing a line-split buffer temporarily.)


    The standard allocators normally only align to alignof(max_align_t), which is often 16B, e.g. long double in the x86-64 System V ABI. But in some 32-bit ABIs it's only 8B, so it's not even sufficient for dynamic allocation of aligned __m128 vectors and you'll need to go beyond simply calling new or malloc.

    Static and automatic storage are easy: use alignas(32) float arr[N];

    C++17 provides aligned new for aligned dynamic allocation. If alignof for a type is greater than the standard alignment, then aligned operator new/operator delete are used. So new __m256[N] just works in C++17 (if compiler supports this C++17 feature; check __cpp_aligned_new feature macro). In practice, GCC / clang / MSVC / ICX support it, ICC 2021 doesn't.

    float *arr = new (std::align_val_t(32)) float[size];  // C++17
    

    Without that C++17 feature, even stuff like std::vector<__m256> will break, not just std::vector<int>, unless you get lucky and it happens to be aligned by 32.


    Plain-delete compatible allocation of a float / int array:

    Unfortunately, auto* arr = new alignas(32) float[numSteps] does not work for all compilers, as alignas is applicable to a variable, a member, or a class declaration, but not as type modifier. (GCC accepts using vfloat = alignas(32) float;, so this does give you an aligned new that's compatible with ordinary delete on GCC).

    Workarounds are either wrapping in a structure (struct alignas(32) s { float v; }; new s[numSteps];) or passing alignment as placement parameter (new (std::align_val_t(32)) float[numSteps];), in later case be sure to call matching aligned operator delete.

    See documentation for new/new[] and std::align_val_t


    Other options, incompatible with new/delete

    Other options for dynamic allocation are mostly compatible with malloc/free, not new/delete:

    #include <stdlib.h>
    int posix_memalign(void **memptr, size_t alignment, size_t size);  // POSIX 2001
    void *aligned_alloc(size_t alignment, size_t size);                // C11 (and ISO C++17)
    

    alignas() with arrays / structs

    In C++11 and later: use alignas(32) float avx_array[1234] as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage documentation has an example of this technique to explain what std::aligned_storage does.

    This doesn't actually work until C++17 for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>), see Making std::vector allocate aligned memory.

    Starting in C++17, the compiler will pick aligned new for types with alignment enforced by alignas on the whole type or its member, also std::allocator will pick aligned new for such type, so nothing to worry about when creating std::vector of such types.


    And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and do p+=31; p&=~31ULL with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256_... intrinsics. But there are even library functions that will help you do this, IIRC, if you insist.

    The requirement to use _mm_free instead of free probably exists in part for the possibility of implementing _mm_malloc on top of a plain old malloc using this technique. Or for an aligned allocator using an alternate free-list.