What happens for an initial count of zero for an x86 rep
prefix?
Intel's manual says explicitly it’s a while count != 0
loop with the test at the top, which is the sane expected behaviour.
But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat
{… count —=1; }
until count == 0;
or who knows.
Nothing happens with RCX=0; rep
prefixes do check for zero first like the pseudocode says. (Unlike the loop
instruction which is exactly like the bottom of a do{}while(--rcx)
, or a dec rcx
/jnz
but without affecting FLAGS.)
I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw
or rep stosw
with a count of 0 or 1, especially in the bad old days before cmov. (cmov
is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods
with a count of zero.) This is not efficient especially for rep stos
on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.) Golden Cove (Alder Lake / Sapphire Rapids) additionally has fast zero-length rep movsb
which makes that the same speed as 1-128 bytes, making it not terrible for use-cases that sometimes do a zero length memcpy.
The same applies for instructions that treat the prefixes as repz
/ repnz
(cmps/scas) instead of unconditional rep
(lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.
If you want to check FLAGS after a repe/ne cmps/scas
, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)
rep movs
and rep stos
have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.
repe/ne scas/cmps
do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q
) according to testing by https://agner.org/optimize/ and https://uops.info/.
-O1
used to use repne scasb
to inline strlen
. This is a disaster for long strings.rep movs
will use no-RFO stores for large sizes, similar to NT stores but not bypassing the cache. Good general Q&A about memory bandwidth considerations.For conditional load / store, APX will also introduce a way to do that efficiently and branchlessly, with scalar instead of AVX2 or AVX-512 masking: a fault-suppressing (Conditionally-Faulting) cfcmovcc [mem], reg
as well as a load form. See Hard to debug SEGV due to skipped cmov from out-of-bounds memory for some about that and other conditional-load things x86 supports.
In 64-bit mode with an address-size prefix to make it use ECX/EDI/ESI instead of RCX/RDI/RSI, writing a 32-bit register will zero-extend into the upper 32 bits. (The ECX can be zero while RCX was non-zero, and the pointer registers might have high garbage so ESI != RSI for example; with 32-bit pointers in long mode ABI maybe that's why you're using an address-size prefix.)
AMD Zen 3 matches Intel's pseudocode, only writing any registers if ECX is non-zero so a modification happens, so high garbage is preserved.
Intel (including Skylake and a couple others Paolo tested) always writes ECX for rep lodsb
at least. But writes EDI only if actually used as a pointer. I haven't tested other instructions to see if their microcode is different.
This doesn't match the pseudocode in Intel's manual where all register writes are inside if
conditions, but it's not rare for the pseudocode to not match corner-case behaviour. (e.g. push rsp
where only the text Description section is accurate for that.) In this case the Description section for rep
doesn't mention that corner case of count=0 with 32-bit registers.