cassemblyx86-64arm64calling-convention

Why modern calling conventions pass variadic arguments in registers?


If we look at a few modern calling conventions, like x86-64 SysV style or AArch64 style (document aapcs64.pdf titled "Procedure Call Standard for the Arm® 64-bit Architecture"), we see explicit notes that variadic arguments are passed in the same way as other arguments. For example, a function call open(path, mode, cflags) on x86-64 will get path in RDI, mode in RSI and (the only variadic one) cflags in RDX.

There is no question with passing static argument set in registers, it is good for resource saving. But if we look into a function that then interprets arguments and so calls va_start for them, we will see that va_start is converted into putting all possible arguments (typically, much more than present really) onto stack; for example, full emulation of printf via vfprintf starts with (I compacted similar rows to avoid too long listings):

my_printf:
        endbr64
; nearly unconditional saving
        subq    $216, %rsp
        movq    %rsi, 40(%rsp)
<...>
        movq    %r9, 72(%rsp)
        testb   %al, %al
        je      .L2
        movaps  %xmm0, 80(%rsp)
<...>
        movaps  %xmm7, 192(%rsp)
; repacking into registers for enclosed vfprintf
.L2:
        movq    %fs:40, %rax
        movq    %rax, 24(%rsp)
        xorl    %eax, %eax
        movl    $8, (%rsp)
        movl    $48, 4(%rsp)
        leaq    224(%rsp), %rax
        movq    %rax, 8(%rsp)
        leaq    32(%rsp), %rax
        movq    %rax, 16(%rsp)
        movq    %rsp, %rcx
        movq    %rdi, %rdx
        movl    $1, %esi
; finally, call the function
        movq    stdout(%rip), %rdi
        call    __vfprintf_chk@PLT
... skipped epilogue

Here 192 bytes of VA frame. Similarly, AArch64 version pushes 184 bytes (x1..x7 and q0..q7).

If the variadic tail of any function call had been always put on stack, things would have got much simpler in code and cheaper in runtime, because all packing and copying had not been needed. va_start would have been reduced to a single move of variadic list starting location (in stack) to a variable. This is how it really worked with i386 (where all arguments were passed on stack). Assembly output of the same trivial wrapper for Linux/i386:

my_printf:
        pushl   %ebx
        subl    $8, %esp
        call    __x86.get_pc_thunk.bx
        addl    $_GLOBAL_OFFSET_TABLE_, %ebx
        leal    20(%esp), %eax ; <--- This is va_start
        pushl   %eax ; VA pointer pushed for vfprintf
        pushl   20(%esp)
        pushl   $1
        movl    stdout@GOT(%ebx), %eax
        pushl   (%eax)
        call    __vfprintf_chk@PLT

Here, the question: why variadic arguments implementation, at least for x86-64 and aarch64, is that complicated and resource wasting?

(I could imagine that there were cases when two styles, both with fixed arguments and with a variadic list, should have been equally allowed in function declarations of the same function. But I donʼt know a case for it. The mentioned open is unlikely the one.)


Solution

  • Note that not all calling conventions do so. For example, the AArch64 calling convention used on macOS passes variadic arguments on the stack.

    That said, a key motivation for passing variadic arguments in registers is that this makes it so neither caller nor callee need to know if a function is variadic or not. For example, if you were to call a prototype-less function declared such:

    int printf();
    

    you wouldn't be able to know if it's a variadic function or not. But by virtue of variadic and non-variadic functions having the same calling convention, the caller can simply set AL as if it was a variadic function and call it, with the callee ignoring AL if it is not.

    This is not possible with the macOS calling convention, where executing programs that don't consistently declare variadic functions with prototypes will fail.