assemblyx86instruction-set

x86 rep prefix with a count of zero: what happens?


What happens for an initial count of zero for an x86 rep prefix?

Intel's manual says explicitly it’s a while count != 0 loop with the test at the top, which is the sane expected behaviour.

But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat {… count —=1; } until count == 0; or who knows.


Solution

  • Nothing happens with RCX=0; rep prefixes do check for zero first like the pseudocode says. (Unlike the loop instruction which is exactly like the bottom of a do{}while(--ecx), or a dec rcx/jnz but without affecting FLAGS.)

    I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw or rep stosw with a count of 0 or 1, especially in the bad old days before cmov. (cmov is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods with a count of zero.) This is not efficient especially for rep stos on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.) Golden Cove (Alder Lake / Sapphire Rapids) additionally has fast zero-length rep movsb which makes that the same speed as 1-128 bytes, making it not terrible for use-cases that sometimes do a zero length memcpy.

    The same applies for instructions that treat the prefixes as repz / repnz (cmps/scas) instead of unconditional rep (lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.

    If you want to check FLAGS after a repe/ne cmps/scas, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)


    rep movs and rep stos have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.

    repe/ne scas/cmps do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q) according to testing by https://agner.org/optimize/ and https://uops.info/.


    For conditional load / store, APX will also introduce a way to do that efficiently and branchlessly, with scalar instead of AVX2 or AVX-512 masking: a fault-suppressing (Conditionally-Faulting) cfcmovcc [mem], reg as well as a load form. See Hard to debug SEGV due to skipped cmov from out-of-bounds memory for some about that and other conditional-load things x86 supports.