If I have an array, or a pointer to an array, of type uint32_t
, my arrays are pretty much always shorter in length than the max I would need to index with a uint32_t
. However, I've found that when using uint64_t
instead of uint32_t
as an index, the compiler avoids some mov
instructions.
Here's a very simple example to illustrate this:
Source code:
#include <cstdint>
void set_value(uint32_t* __restrict array, const uint32_t idx, const uint32_t value)
{
array[idx] = value;
}
void set_value(uint32_t* __restrict array, const uint64_t idx, const uint32_t value)
{
array[idx] = value;
}
Generated assembly using Clang 14.0.0 with -O3 -march=skylake -std=c++17 -mno-vzeroupper
:
set_value(unsigned int*, unsigned int, unsigned int):
mov eax, esi
mov dword ptr [rdi + 4*rax], edx
ret
set_value(unsigned int*, unsigned long, unsigned int):
mov dword ptr [rdi + 4*rsi], edx
ret
Can anyone please explain the different compiler outputs? Why does the uint64_t
version skip the mov
instruction?
In the uint32_t
version, the routine was passed the index as 32 bits in the 32-bit esi
register, but the compiler needs a 64-bit index to go with the 64-bit base address of the array. The compiler is not sure the upper bits of the full 64-bit rsi
register (of which esi
is a part) are zero, so it has to use an instruction that clears the upper 32 bits.
Although mov eax, esi
is nominally a 32-bit instruction, Intel 64 and IA-32 Architectures Software Developer’s Manual, December 2017, clause 3.4.1.1, “General-Purpose Registers in 64-Bit Mode” tells us:
32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register.
(This zero extension is primarily so that the contents of the result register depends solely on the instruction being executed; it will not depend on prior instructions that left something in the upper 32 bits. If the processor had to wait for output from prior instructions because it did not know the program did not actually need those bits, it could slow program execution.)