visual-c++x86-64memory-alignmentcalling-conventionabi

Why does MSVC x64 C use 8-byte int32 parameter alignment instead of 4-byte?


I'm writing a C compiler as a hobby and would like it to be able to link against C static libraries produced by MSVC.

I read the Microsoft x64 ABI, and it doesn't seem to have a strongly mandated alignment for integer primitive types. It "recommends" aligning them with their natural size, for example an int32 would be 4-byte aligned.

But when I compile a minimal program that passes many ints as parameters, it's clearly using 8-byte alignment for them, despite only referencing them as DWORDs.

int add_many_args(int a, int b, int c, int d, int e, int f, int g, int h) {
    return a + b + c + d + e + f + g + h;
}
a$ = 8
b$ = 16
c$ = 24
d$ = 32
e$ = 40
f$ = 48
g$ = 56
h$ = 64
add_many_args PROC
        mov     DWORD PTR [rsp+32], r9d
        mov     DWORD PTR [rsp+24], r8d
        mov     DWORD PTR [rsp+16], edx
        mov     DWORD PTR [rsp+8], ecx
        mov     eax, DWORD PTR b$[rsp]
        mov     ecx, DWORD PTR a$[rsp]
        add     ecx, eax
        mov     eax, ecx
        add     eax, DWORD PTR c$[rsp]
        add     eax, DWORD PTR d$[rsp]
        add     eax, DWORD PTR e$[rsp]
        add     eax, DWORD PTR f$[rsp]
        add     eax, DWORD PTR g$[rsp]
        add     eax, DWORD PTR h$[rsp]
        ret     0
add_many_args ENDP

First question is why would it do that? Why isn't it aligning them using the natural size, 4 bytes?

Second question is: as I try to write a compiler that aims to be able to link against C static libraries, how am I supposed to know what alignment the library used, so that my code can correctly pass stack parameters to library functions? I hear people say that the "C ABI is stable", so where are the rules for this written down?


Solution

  • Look at local vars or struct layout. The Windows x64 calling convention makes every arg take exactly 8 bytes (1 register or stack slot), so variadic functions are easy just by dumping the 4 arg-passing regs to shadow space and indexing the args as an array.

    It's normal for other calling conventions to make each arg take the stack space of a register, instead of having complicated rules for foo(int a, int64_t b, double c) to make sure the wider args are aligned.

    The Windows x64 docs (https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170#parameter-passing) don't clearly state that stack arg slots are always 8 bytes even for narrow types, but they are.

    The normal reason for making stack args take a full stack slot is to allow narrow args to be written with push, but you don't normally do that in Windows x64 because shadow space goes below them. So normally you'd sub rsp, imm8 at the top of a function and use mov to store args, not constantly push and dealloc / realloc shadow space. I can't immediately think of a reason why packing narrow args wouldn't work, just enforcing that each one is aligned by at least alignof(T), but it's not a big deal. Especially since aligning RSP by 16 before a call would often mean rounding up the space needed for stack args.


    Examples

    Godbolt with MSVC and GCC -O2 -mabi=ms, and GCC -O2 targeting Linux (-mabi=sysv being the default for Godbolt's Linux compilers.)

    int foo(){
        volatile int a = 1;
        volatile int b = 2;
        volatile int c = 3;
        return a+b+c;
    }
    

    Huh, strangely MSVC chooses to put each one in a separate 8-byte slot of the shadow space its caller reserved.

    ; x64 MSVC 19.40 -O2
    c$ = 8               ; offsets from the return address where RSP points on function entry
    b$ = 16
    a$ = 24
    int foo(void) PROC                                        ; foo, COMDAT
            mov     DWORD PTR a$[rsp], 1
            mov     DWORD PTR b$[rsp], 2
            mov     DWORD PTR c$[rsp], 3
            mov     ecx, DWORD PTR c$[rsp]
            mov     eax, DWORD PTR b$[rsp]      ; apparently it doesn't want to add eax, mem with volatile?
            add     ecx, eax
            mov     eax, DWORD PTR a$[rsp]
            add     eax, ecx
            ret     0
    

    But GCC does what I expected:

    Linux GCC 14.2 -O2 -mabi=ms
    foo():
            sub     rsp, 24             # unfortunately fails to use its shadow space
            mov     DWORD PTR [rsp+4], 1
            mov     DWORD PTR [rsp+8], 2
            mov     DWORD PTR [rsp+12], 3
            mov     eax, DWORD PTR [rsp+4]
            mov     ecx, DWORD PTR [rsp+8]
            mov     edx, DWORD PTR [rsp+12]  # volatile defeats add eax, mem
            add     rsp, 24
            add     eax, ecx
            add     eax, edx
            ret
    

    In a debug build with more variables, MSVC will pack them only 4 bytes apart. In an optimized build with a bunch more unused volatile variables all =2 from copy/paste, it will store them all in the same place, [rsp+32]!! (I put an #if 0 in the godbolt link.)

    struct int3{
        int a,b,c;
    };
    
    int bar(int3 st){
        return st.a + st.b + st.c;
    }
    
    ; x64 MSVC -O2
    int bar(int3) PROC                       ; bar, COMDAT
            mov     eax, DWORD PTR [rcx+8]
            add     eax, DWORD PTR [rcx+4]
            add     eax, DWORD PTR [rcx]
            ret     0
    

    Windows x64 passes objects larger than 8 bytes by pointer to space allocated by the caller. So it's like bar(int3 &st) except the caller needs to copy so changes made to the arg object aren't visible in the caller's copy if its value is used after the call.

    Just for fun, compare the x86-64 System V calling convention which passes structs up to 16 bytes in a pair of registers. In this case, the first two integer arg-passing regs for that convention, RDI and RSI:

    # x86-64 Linux GCC -O2
    bar(int3):
            mov     rax, rdi
            shr     rax, 32         # st.b
            add     eax, edi        # st.a
            add     eax, esi        # st.c