int_fast8_t size vs int_fast16_t size on x86-64 platform

I already learned that on the x86-64 platform using any 64-bit register would need a REX prefix, and any address less than 64 bits would require an address-size prefix.

On x86-64 bit:

E3 rel8 is jrcxz

67 E3 rel8 is jecxz

67 is opcode for address-size override prefix.

sizeof(int_fast8_t) is 8 bits while others sizeof(int_fast16_t) and sizeof(int_fast32_t) (on Linux only) are 64 bits.

Why is only int_fast8_t 8 bits when other fast typdef is 64 bits?

Has it something to do with alignment?

Solution

Why only int_fast8_t is 8 bit while other fast typdef is 64 bit?

Because glibc made a simplistic and arguably bad choice ~~when x86-64 was new~~ when these C99 types were new, and made the bad decision not to specialize it for x86-64.

All of int_fast16/32/64_t are defined as long across all platforms. That was done in May 1999 before AMD64 was announced with a paper spec (Oct 1999) which devs presumably took some time to grok. (Thanks @Homer512 for finding the commit and the history.)

long is full (integer) register width in 32 and 64-bit GNU systems. Which is also pointer width.

For most 64-bit RISCs, full width is fairly natural, although IDK about multiply and divide speeds. It's glaringly bad for x86-64 where 64-bit operand-size takes extra code-size, but MIPS daddu and addu for example are the same code size and presumably equivalent performance. (Before x86-64, it was common for RISC ABIs to keep narrow types sign-extended to 64-bit all the time, because MIPS at least actually required that for non-shift instructions. See MOVZX missing 32 bit register to 64 bit register for some more history.)

Glibc's choice makes these types mostly ok for local variables, at least if you don't multiply or divide or __builtin_popcount or any other operation that might take more work with more bits (especially without hardware popcnt support). But not good anywhere that storage space in memory matters.

If you were hoping for a "pick a bigger-than-specified size only if that avoids any performance potholes" type, that's not remotely what glibc is giving you.

I seem to recall MUSL making a better choice on x86-64, like maybe every fast size being the minimum size except maybe fast16 being 32-bit, avoiding operand-size prefixes and partial-register stuff.

fast raises the question "fast for what?", and the answer isn't the same size for every use-case. For example, in something that can auto-vectorize with SIMD, the narrowest integers possible are usually the best, to get twice as much work done per 16-byte vector instruction. In that case, 16-bit integers can be justified. Or just for cache footprint in arrays. But don't expect that fastxx_t types are going to consider a tradeoff of "not too much slower" vs. saving size in arrays.

Usually narrow load/store instructions are fine on most ISAs, so you should have int or int_fastxx_t locals and narrow array elements if cache footprint is a relevant consideration. But glibc's choice is often bad even for local vars.

Maybe glibc people were only counting instructions, not code-size (REX prefixes) or the cost of multiply and divide (which definitely was slower for 64-bit than 32 or narrower, especially on those early AMD64 CPUs; Integer division was still much slower for 64-bit on Intel until Ice Lake https://uops.info/ and https://agner.org/optimize/).

And not looking at the effect on struct sizes both directly and due to the alignof(T) == 8. (Although the sizes of the fast types aren't set in the x86-64 System V ABI, so it's probably best not to use them at ABI boundaries, like structs involved in a library API.)

I don't really know why they made such a bad mistake, but it makes int_fastxx_t types useless for anything except local variables (not most structs or arrays) because x86-64 GNU/Linux is an important platform for most portable code, and you don't want your code to suck there.

Kind of like how MinGW's braindead decision to have std::random_device return low-quality random numbers (instead of failing until they got around to implementing something usable) was like dumping radioactive waste on it as far as portable code being able to use the language feature for the intended purpose.

One of the few advantages of using 64-bit integers is maybe avoiding dealing with garbage in the high part of regs at ABI boundaries (function args and return values). But normally that doesn't matter, unless you need to extend it to pointer width as part of an addressing mode. (In x86-64, all registers in an addressing mode have to be the same width, like [rdi + rdx*4]. AArch64 has modes like [x0, w1 sxt] that sign-extends a 32-bit register as an index for a 64-bit register. But AArch64's machine-code format was designed from scratch, and came later with the hindsight of seeing other 64-bit ISAs in action.)

e.g. arr[ foo(i) ] can avoid an instruction to zero-extend a return value if the return type fills a register. Otherwise it needs to be sign or zero-extended to pointer width before it can be used in an addressing mode, with a mov or movsxd (32 to 64-bit) or movzx or movsx (8 or 16-bit to 64-bit).

Or with the way x86-64 System V passes and returns structs by value in up to 2 registers, 64-bit integers don't need any unpacking since they're already in a register by themself. e.g. struct ( int32_t a,b; } has both ints packed into RAX in a return value, needing work in the callee to pack and caller to unpack if actually using the result, not just storing the object-representation to a struct in memory. (e.g. mov ecx, eax to zero-extend the low half / shr rax, 32. Or just add ebx, eax to use the low half and then discard it with the shift; you don't need to zero-extend it to 64-bit to use it just as a 32-bit integer.)

Within a function, compilers will know a value is already zero-extended to 64-bit after writing a 32-bit register. And loading from memory, even sign-extension to 64-bit is free (movsxd rax, [rdi] instead of mov eax, [rdi]). (Or almost free on older CPUs where memory-source sign-extension still needed an ALU uop, not done as part of a load uop.)

Because signed integer overflow is UB, compilers are able to widen int (int32_t) to 64-bit in loops like for (int i = 0 ; i < n ; i++ ) arr[i] += 1;, or convert it to a 64-bit pointer increment. (I wonder if GCC maybe couldn't do this back in the early 2000s when these software design decisions were being made? In that case, yes, wasted movsxd instructions to keep re-extending a loop counter to 64-bit would be an interesting consideration.)

But to be fair, you can still have sign-extension instructions from using signed 32-bit integer types in computations which might produce negative results if you then use those to index arrays. So 64-bit int_fast32_t avoids those movsxd instructions, at the cost of being worse in other cases. Maybe I'm discounting this because I know to avoid it, e.g. using unsigned when appropriate because I know it zero-extends for free on x86-64 and AArch64.

For actual computation, 32-bit operand-size is generally at least as fast as anything else including for imul/div and popcnt, and avoids partial-register penalties or extra movzx instructions you get with 8-bit or 16-bit.

The advantages of using 32bit registers/instructions in x86-64
Why is default operand size 32 bits in 64 mode? - 32-bit needs no REX or operand-size prefix.

But 8-bit is not bad, and if your numbers are that small, it's even worse to balloon them to 32 or 64-bit; there's probably more of an expectation from programmers that int_fast8_t will be small unless it's a lot more expensive to make it larger. It isn't on x86-64; Are there any modern CPUs where a cached byte store is actually slower than a word store? - yes, most non-x86 apparently, but x86 does make bytes and 16-bit words fast for load/store as well as computation.

Avoiding 16-bit is probably good, worth the cost of an extra 2 bytes in some cases. add ax, 12345 (and other imm16 instructions) have LCP decode stalls on Intel CPUs. Plus partial-register false dependencies (or on older CPUs, merging stalls).

jrcxz vs. jecxz is a weird example because it uses the 67h address-size prefix, rather 66h operand-size. And because compilers never(?) use it. It's not as slow as the loop instruction, but it's surprisingly not single-uop even on Intel CPUs that can macro-fuse a test/jz into a single uop.