This is from musl's source code:
1 __syscall_cp_asm:
2 __cp_begin:
3 mov (%rdi),%eax
4 test %eax,%eax
5 jnz __cp_cancel
6 mov %rdi,%r11
7 mov %rsi,%rax
8 mov %rdx,%rdi
9 mov %rcx,%rsi
10 mov %r8,%rdx
11 mov %r9,%r10
12 mov 8(%rsp),%r8
13 mov 16(%rsp),%r9
14 mov %r11,8(%rsp)
15 syscall
16 __cp_end:
17 ret
18 __cp_cancel:
19 jmp __cancel
I am curious what the purpose of lines 6 and 14 is (renumbered from the linked source).
From what I understand the beginning of the code tests the target of the pointer passed as the 1st argument (lines 3–5), line 6 then moves the pointer to r11 and line 14 then moves it to the place on the stack that was used to pass the 7th argument.
This doesn't seem useful. Do these moves accomplish anything?
This is to support pthread cancellation points; a signal handler can later look at the stack.
The commit log for the commit that introduced this code explains that storing a pointer at a known place on the stack before a syscall makes it possible for the "cancellation signal handler" to determine "whether the interrupted code was in a cancellable state." (The initial version of that code also saves the address of the syscall
instruction, but later commits changed that.)
The first arg (which that asm function stores on the stack) comes from its C caller, __syscall_cp_c
, which passes __syscall_cp_asm(&self->cancel, nr, u, v, w, x, y, z);
, where self
came from __pthread_self()
.
You're correct, overwriting the caller's stack arg with a different incoming arg is not "visible" to a C caller following the x86-64 System V ABI. (A callee owns its stack args; the caller has to assume they've been overwritten so compiler generated code will never read that memory location as an output). So we needed to look for alternate explanations.
Using 2 total mov instructions to copy the incoming RDI into the 8(%rsp)
after reading that memory location is I think necessary. We can't delay the mov %rdx,%rdi
until after the load because we need to free up RDX to hold R8, to free up R8 to hold the load. You could avoid touching an "extra" register by using R10 before it's used to load the other arg, but it would still take at least 2 instructions.
Or the arg order could be optimized to pass that pointer in a later arg, perhaps passing the call number last and the pthread pointer in the last register arg (minimal shuffling but avoiding need for a double dereference for that test/branch) or the first stack arg (where you want it anyway). Or match the arg order of the __syscall
wrapper that takes nr
first with no pthread pointer.