Using AVX512 instructions, I can use an index vector to gather 16 single precision values from an array. However, such gather operations are not that efficient and issue at a rate of only 2 scalar loads/cycle on my machine.
In one of my applications, I always need to gather four contiguous float
elements. In scalar pseudocode:
for (int i = 0; i < 16; ++i) {
result.x[i] = source[offset[i]*4 + 0];
result.y[i] = source[offset[i]*4 + 1];
result.z[i] = source[offset[i]*4 + 2];
result.w[i] = source[offset[i]*4 + 3];
}
NVIDIA GPUs can do sort of thing with a single ld.global.v4.f32
instruction. On the CPU, it also seems that one should be able to exploit this contiguity to do better than 4 16-wide gathers. Does anybody here know a faster AVX512 instruction sequence that would improve on the naive strategy? It's fine to assume arbitrary alignment.
Gathers on Intel and AMD CPUs can't take advantage of source elements being contiguous with each other; they access cache separately for each scalar element. (https://uops.info/ - note that even though the port 2/3 uop count is low on recent Intel, the throughput matches the 2/clock or 3/clock cache read bottleneck).
Also, on Intel from Skylake through Tiger Lake, gather instruction throughput is now garbage due to microcode mitigation for GDS (https://downfall.page/#faq). And AMD gathers have never been very fast.
As chtz says, you should manually do 128-bit SIMD loads since your access pattern is simple.
But then I guess you need to shuffle 4 such vectors to get x[i + 0..3]
contiguous for a 128-bit store, and so on, since you're scattering the results into a Struct-of-Arrays output.
You can do two vmovups
XMM / vinsertf128
YMM, m128 pairs, then shuffle them together with vpermt2ps
to get one 512-bit vector with all four x
values contiguous, then all four y
values contiguous, etc. (In C++ with intrinsics, use _mm512_castps256_ps512
to reinterpret a __m256
as a __m512
whose upper half is don't-care undefined garbage.)
This sets up for vextractf32x4
to memory for the high two, vextractf128
for lane 1, and vmovups
for lane 0. Hopefully a compiler will optimize that way if you do 4x _mm_store_ps(&result.x[i], _mm512_extractf32x4_ps(v, 0) )
etc., but you can manually use
__m512 v = _mm512_permutex2var_ps(_mm512_castps256_ps512(first_2_loads),
_mm512_castps256_ps512(second_2_loads),
shuffle_constant);
_mm_store_ps(&result.x[i], _mm512_castps512_ps128(v) ); // vmovups
_mm_store_ps(&result.y[i], _mm256_extractf128_ps( _mm512_castps512_ps256(v), 1) ); // vextractf128 mem, ymm, 1
_mm_store_ps(&result.z[i], _mm512_extractf32x4_ps(v, 2) ); // vextractf32x4 mem, zmm, 2
_mm_store_ps(&result.w[i], _mm512_extractf32x4_ps(v, 3) );
Or avoid 512-bit vectors and do two separate 256-bit vpermt2ps
, if the rest of your code doesn't use 512-bit vectors.
(If you were doing more iterations, you could shuffle differently to set up for even wider stores, like 256 or maybe even 512 if it's worth shuffling that much.)
AVX-512 has scatter instructions, but they're not efficient even on Intel, and far worse on AMD Zen 4. You could just vinsertf128 / vinsert32x4 etc. and do a scatter from that with offsets that include the distance between the starts of x
and y
vectors, assuming they're allocated within 2GiB of each other so a 32-bit offset can reach. But that would be much slower, I think not even achieving one 32-bit store per clock (or 2 on Ice Lake and later. But commit from the store buffer to L1d is only 1 per clock unless two back-to-back stores are to the same cache line, then they can coalesce. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#memory-related-limits)