gcc vectorization cpu-architecture simd avx2

AVX2 / gcc: Improve CPU-level parallelism by using different registers

I have this code:

__attribute__((target("avx2")))
size_t lower_than_16(const uint64_t values[16], uint64_t x)
{
    __m256i vx = _mm256_set1_epi64x(x);
    __m256i vvals1 = _mm256_loadu_si256((__m256i*)&values[0]);
    __m256i vvals2 = _mm256_loadu_si256((__m256i*)&values[4]);
    __m256i vvals3 = _mm256_loadu_si256((__m256i*)&values[8]);
    __m256i vvals4 = _mm256_loadu_si256((__m256i*)&values[12]);
    __m256i vcmp1  = _mm256_cmpgt_epi64(vvals1, vx);
    __m256i vcmp2  = _mm256_cmpgt_epi64(vvals2, vx);
    __m256i vcmp3  = _mm256_cmpgt_epi64(vvals3, vx);
    __m256i vcmp4  = _mm256_cmpgt_epi64(vvals4, vx);
    const int mask = (_mm256_movemask_pd((__m256d)vcmp1)) |
                    (_mm256_movemask_pd((__m256d)vcmp2) << 4) |
                    (_mm256_movemask_pd((__m256d)vcmp3) << 8) |
                    (_mm256_movemask_pd((__m256d)vcmp4) << 12);
    if (mask != 0xFFFF) {
        // found
        return __builtin_ctz(~mask);
    }

    return 16;
}

Basically an array of 16 elements is given and I want to find the index of the first element where values[i] <= x is true. If no element can be found, then 16 is returned.

This is implemented with AVX2 and I use gcc as a compiler. The assembly looks like this:

lower_than_16:
        vmovq   xmm2, rsi
        vmovdqu ymm1, YMMWORD PTR [rdi]
        vpbroadcastq    ymm0, xmm2
        vpcmpgtq        ymm1, ymm1, ymm0
        vmovmskpd       esi, ymm1
        vmovdqu ymm1, YMMWORD PTR [rdi+32]
        vpcmpgtq        ymm1, ymm1, ymm0
        vmovmskpd       eax, ymm1
        vmovdqu ymm1, YMMWORD PTR [rdi+64]
        sal     eax, 4
        vpcmpgtq        ymm1, ymm1, ymm0
        vmovmskpd       ecx, ymm1
        vmovdqu ymm1, YMMWORD PTR [rdi+96]
        sal     ecx, 8
        vpcmpgtq        ymm0, ymm1, ymm0
        or      eax, ecx
        or      eax, esi
        vmovmskpd       edx, ymm0
        sal     edx, 12
        or      eax, edx
        mov     edx, 16
        cmp     eax, 65535
        je      .L1
        not     eax
        xor     edx, edx
        rep bsf edx, eax
.L1:
        mov     rax, rdx
        vzeroupper
        ret

(can be seen here: https://godbolt.org/z/7eea39Gqv)

I see that gcc always uses the same register for each unrolled iteration. However, wouldn't it be more efficient if different ymm-registers would be used for each of the unrolled iterations, because then the CPU could easier parallelize the execution of those 4 independent comparisons? I know the CPU does some register renaming, but is it smart enough not to enforce that those instructions are not executed in parallel? Or would it be easier / more efficient if different registers would be used?

Thanks a lot

Solution

Every register-write is renamed, and no CPUs have any limits on renaming the same register many times per clock cycle. e.g. front-end throughput isn't reduced if all the instructions write XMM0 or EFLAGS or RCX or whatever. (At least no x86 CPus; Agner Fog's microarch guide specifically mentions this for every one he's tested.)
There's no down-side to GCC register-allocation choices here.

When register-file size is the limiting factor in how far ahead out-of-order exec can see¹, it doesn't matter much if at all whether an instruction writes a "cold" register or overwrites a recent result. (I haven't tried to specifically test it. I've never seen a recommendation one way or the other in Agner Fog's or Intel's optimization manuals, or anywhere else.)

An entry in a Physical Register File (PRF) can't be freed until an instruction that overwrote it has retired (from the Reorder Buffer aka ROB), since an external interrupt at any point could discard non-retired instructions and roll back to the retirement state. (Unless there's some trick I'm forgetting or not aware of which could allow freeing entries earlier.)

If the two parts of FLAGS (CF and the SPAZO group) were written by separate instructions (e.g. after an inc), two separate PRF entries will be referenced. An instruction like add, cmp, or shl which writes all of them will free both entries when it retires. But integer and FP/SIMD have separate register files so you don't need to worry about leaving FLAGS split in a SIMD loop using inc or dec as the loop condition. And FLAGS is written so often in integer code that I'd expect this is very rarely a problem, or at least not one you can do anything about without spending more instructions which is worse. Except sometimes it could be a very minor reason to use add 1 instead of inc.
(In normal x86 designs, integer PRF entries have enough room to hold a 64-bit integer result and a full FLAGS result. So in terms of tracking inputs and outputs, an instruction like add still only has one output. So CF, SPAZO, and a register like R15 can all have their RAT entries pointing at the same PRF entry. The PRF entry can't be freed until all references are dead.)

You might think that mov-elimination (handled during rename instead of via an execution unit) getting two registers to point to the same PRF entry would help. It might in fact do so, but unfortunately there's limited capacity of mov-elimination slots to reference-count such doubled-up entries, especially in the first few generations of CPUs to have the feature (Ivy Bridge).
So normally you want to overwrite the result of a mov soon, to free up resources for eliminating future mov instructions. e.g. mov eax, ecx / or eax, 0x55 modifying the new copy, instead of modifying the original, unless you're also going to overwrite the copy with something else soon. But Ice Lake disabled integer mov-elim with a microcode update, so mov has non-zero latency and (all else equal) should still be kept off the critical path when tuning for generic or Ice Lake.

Footnote 1: PRF capacity vs. things like ROB capacity being the limiting factor:

https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
https://chipsandcheese.com/p/amds-ryzen-9950x-zen-5-on-desktop (Zen 5 increase the ROB capacity more than the integer PRF, and their measurements show PRF capacity often being the limit.)

p.s. Intel's P6-family (Nehalem and earlier) has a different effect which has some influence on how you choose to use registers: register-read stalls, where there's a limit on how many "cold" (not recently written) regs can be read per clock cycle, during alloc/rename/issue. It doesn't have a PRF: it keeps results in the ROB itself, but has to read cold values from the "retirement register file". But that has no bearing on whether you choose to overwrite multiple different regs vs. the same one multiple times; it's the distance between write and read that matters.