Optimal instruction sequence for AVX512 gather of 4D vectors

Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar loads/cycle on my machine.

In one of my applications, I always need to gather four contiguous float elements. In scalar pseudocode:

for (int i = 0; i < 16; ++i) {
    result.x[i] = source[offset[i]*4 + 0];
    result.y[i] = source[offset[i]*4 + 1];
    result.z[i] = source[offset[i]*4 + 2];
    result.w[i] = source[offset[i]*4 + 3];
}

NVIDIA GPUs can do sort of thing with a single ld.global.v4.f32 instruction. On the CPU, it also seems that one should be able to exploit this contiguity to do better than 4 16-wide gathers. Does anybody here know a faster AVX512 instruction sequence that would improve on the naive strategy? It's fine to assume arbitrary alignment.

Solution

Gathers on Intel and AMD CPUs can't take advantage of source elements being contiguous with each other; they access cache separately for each scalar element. (https://uops.info/ - note that even though the port 2/3 uop count is low on recent Intel, the throughput matches the 2/clock or 3/clock cache read bottleneck).

Also, on Intel from Skylake through Tiger Lake, gather instruction throughput is now garbage due to microcode mitigation for GDS (https://downfall.page/#faq). And AMD gathers have never been very fast.

As chtz says, you should manually do 128-bit SIMD loads since your access pattern is simple.

But then I guess you need to shuffle 4 such vectors to get x[i + 0..3] contiguous for a 128-bit store, and so on, since you're scattering the results into a Struct-of-Arrays output.

You can do two vmovups XMM / vinsertf128 YMM, m128 pairs, then shuffle them together with vpermt2ps to get one 512-bit vector with all four x values contiguous, then all four y values contiguous, etc. (In C++ with intrinsics, use _mm512_castps256_ps512 to reinterpret a __m256 as a __m512 whose upper half is don't-care undefined garbage.)

This sets up for vextractf32x4 to memory for the high two, vextractf128 for lane 1, and vmovups for lane 0. Hopefully a compiler will optimize that way if you do 4x _mm_store_ps(&result.x[i], _mm512_extractf32x4_ps(v, 0) ) etc., but you can manually use

  __m512 v = _mm512_permutex2var_ps(_mm512_castps256_ps512(first_2_loads),
                                    _mm512_castps256_ps512(second_2_loads),
                                    shuffle_constant);

  _mm_store_ps(&result.x[i],  _mm512_castps512_ps128(v) );  // vmovups
  _mm_store_ps(&result.y[i],  _mm256_extractf128_ps( _mm512_castps512_ps256(v), 1) );   // vextractf128 mem, ymm, 1
  _mm_store_ps(&result.z[i],  _mm512_extractf32x4_ps(v, 2) );  // vextractf32x4 mem, zmm, 2
  _mm_store_ps(&result.w[i],  _mm512_extractf32x4_ps(v, 3) );

Or avoid 512-bit vectors and do two separate 256-bit vpermt2ps, if the rest of your code doesn't use 512-bit vectors.

(If you were doing more iterations, you could shuffle differently to set up for even wider stores, like 256 or maybe even 512 if it's worth shuffling that much.)

AVX-512 has scatter instructions, but they're not efficient even on Intel, and far worse on AMD Zen 4. You could just vinsertf128 / vinsert32x4 etc. and do a scatter from that with offsets that include the distance between the starts of x and y vectors, assuming they're allocated within 2GiB of each other so a 32-bit offset can reach. But that would be much slower, I think not even achieving one 32-bit store per clock (or 2 on Ice Lake and later. But commit from the store buffer to L1d is only 1 per clock unless two back-to-back stores are to the same cache line, then they can coalesce. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#memory-related-limits)