Intel CET (control-flow enforcement technology) consists of two pieces: SS (shadow stack) and IBT (indirect branch tracking). If you need to indirectly branch to somewhere that you can't put an endbr64
for some reason, you can suppress IBT for a single jmp
or call
instruction with notrack
. Is there an equivalent way to suppress SS for a single ret
instruction?
For context, I'm thinking about how this will interact with retpolines, which the key control flow of goes more-or-less like push real_target; call retpoline; pop junk; ret
. If there's not a way to suppress SS for that ret
, then is there some other way for retpolines to work when CET is enabled? If not, what options will we have? Will we need to maintain two sets of binary packages for everything, one for old CPUs that need retpolines, and one for new CPUs that support CET? And what about if Intel turns out to be wrong, and we do end up still needing retpolines on their new CPUs? Will we have to abandon CET to use them?
After playing with the assembly for a bit, I discovered that you can use retpolines with CET, but it's less than ideal. Here's how. For reference, consider this C code:
extern void (*fp)(void);
int f(void) {
fp();
return 0;
}
Compiling it with gcc -mindirect-branch=thunk -mfunction-return=thunk -O3
yields this:
f:
subq $8, %rsp
movq fp(%rip), %rax
call __x86_indirect_thunk_rax
xorl %eax, %eax
addq $8, %rsp
jmp __x86_return_thunk
__x86_return_thunk:
call .LIND1
.LIND0:
pause
lfence
jmp .LIND0
.LIND1:
lea 8(%rsp), %rsp
ret
__x86_indirect_thunk_rax:
call .LIND3
.LIND2:
pause
lfence
jmp .LIND2
.LIND3:
mov %rax, (%rsp)
ret
It turns out you can make this work just by modifying the thunks to look like this:
__x86_return_thunk:
call .LIND1
.LIND0:
pause
lfence
jmp .LIND0
.LIND1:
push %rdi
movl $1, %edi
incsspq %rdi
pop %rdi
lea 8(%rsp), %rsp
ret
__x86_indirect_thunk_rax:
call .LIND3
.LIND2:
pause
lfence
jmp .LIND2
.LIND3:
push %rdi
rdsspq %rdi
wrssq %rax, (%rdi)
pop %rdi
mov %rax, (%rsp)
ret
By using the incsspq
, rdsspq
, and wrssq
instructions, you can modify the shadow stack to match your changes to the real stack. I tested those modified thunks with Intel SDE, and they indeed made the control flow errors go away.
That was the good news. Here's the bad news:
endbr64
, the CET instructions I used in the thunks aren't NOPs on CPUs that don't support CET (they result in SIGILL
). This means you'd need two different sets of thunks, and you'd need to use CPU dispatch to pick the right ones depending on whether CET is available.__x86_indirect_thunk_rax
check for the presence of the endbr64
instruction, but that's really inelegant and would probably be really slow.