Performance Difference Between _mm512_load_si512 and _mm512_stream_load_si512

I'm currently working on a project that involves AVX512 instructions and I have a question regarding the performance differences between _mm512_load_si512, _mm512_loadu_si512, and _mm512_stream_load_si512 when copying data from one pointer to another.

As I understand it, _mm512_load_si512 and _mm512_stream_load_si512 are designed to work with aligned memory addresses and are expected to perform faster in such cases. However, in my experiments, I haven't observed significant performance differences between these instructions, including _mm512_loadu_si512 which is meant for unaligned memory.

I found a question on stackoverflow mentioning _mm512_load_si512 and _mm512_loadu_si512 but not _mm512_stream_load_si512.

My CPU is a Corei7-1185G7 @ 3.00GHz, which supports AVX512. I wonder if there are any specific considerations or insights into the performance characteristics of these instructions on this particular processor.

If anyone has experience or knowledge of AVX512 instructions or if there are nuances in memory alignment that I might be overlooking, I would greatly appreciate any guidance or suggestions.

Thank you in advance for your time and expertise.

My code (replace _mm512_stream_load_si512 with _mm512_load_si512 and _mm512_loadu_si512):

// Both dst and src pointers are 64-byte aligned
void copy_data(__m512i *dst, __m512i *src, size_t size)
{
    __m512i block;
    for (; size; dst++, src++, size--)
    {
        block = _mm512_stream_load_si512(src);
        _mm512_stream_si512(dst, block);
    }
}

Something to add, replacing _mm512_stream_si512 with _mm512_store_si512 or _mm512_storeu_si512 actually makes a difference. And overall, _mm512_stream_si512 performs about 1.5x faster in copying data on my system.

Solution

loadu is as fast as load when the data does happen to be aligned at run-time. The CPU is able to handle it optimistically, taking the fast-path on aligned addresses.

The only difference is on unaligned data (which is always a cache-line split for 512-bit loads): fault vs. having hardware handle the unaligned case a bit more slowly. (Using another cycle later in the same load execution unit to access the other cache line.) Or a larger penalty on page-split (across a 4K boundary), but not a disaster on Skylake and later, i.e. on any Intel CPU new enough for AVX-512.

loadu is great when your data is usually aligned, but you still want correct behaviour with the occasional unaligned pointer. Using extra instructions to always reach an alignment boundary or check and branch on misalignment would slow down the aligned case slightly, and be rather inconvenient for copy functions like this which might have different misalignements for the two pointers. With 512-bit vectors, misalignment has enough penalty to be worth maybe worrying about, unlike with 256-bit vectors where it's tiny.

Faulting is useful for detecting accidentally misaligned data. But note that compilers can fold _mm512_load_si512 into a memory source operand for an ALU instruction, which doesn't require alignment. e.g. using the result as an operand for _mm512_add_epi32 can end up compiling to vpaddd zmm0, zmm1, [rdi]. A debug build typically won't do that, if you do want to check for use of unaligned data in places you intended not to.

Aligned data being full-speed for alignment-not-required loads/stores has been true since Nehalem for Intel, and since at least Bulldozer-family for AMD, so it's true for all AVX CPUs, not just AVX-512. On even older CPUs like Core 2, the unaligned instructions like movdqu decoded to multiple uops so were always slow. This is why SSE memory operands for ALU instructions like addps xmm0, [rdi] require alignment but AVX memory operands don't (like vaddps xmm0, xmm0, [rdi]), except with alignment-checking instructions like vmovaps. With legacy SSE code, _mm_add_epi8(vec, _mm_loadu_si128(ptr)) can't fold into a memory operand for paddb, only _mm_load_si128. AVX changes that, even for 128-bit vectors. (Compilers will use the AVX VEX encodings for 128-bit vectors when you compile with AVX enabled.)

`stream_load` does nothing on normal (WB) memory regions

Unless something changed with the AVX-512 version, stream_load is just a slower version of load unless you're using it on WC memory (e.g. video RAM). See What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region?

Unlike NT loads, NT stores are useful on normal (WB cacheable) memory regions. See Enhanced REP MOVSB for memcpy re: no-RFO stores (Read For Ownership) and memory bandwidth especially for copying.

A 512-bit aligned store does replace every byte in the cache line (unless you're using a mask and it's not all-ones), but I'm not sure such a store would avoid reading the old contents from DRAM on cache miss. (i.e. avoiding the read-for-ownership.) That could be separate from how NT stores bypass cache, e.g. rep movsb is supposed to be able to use a no-RFO store protocol while still having its store data end up in cache (because it also writes contiguous data so it knows whole lines will get written.)

Performance Difference Between _mm512_load_si512 and _mm512_stream_load_si512

stream_load does nothing on normal (WB) memory regions

`stream_load` does nothing on normal (WB) memory regions