Reason for (win) x64-calling convention restrictions in epilogues

I'm trying to understand the actual reasoning behind the restrictions that the x64 calling convention has on epilogues, at least on Windows? Quoting from MSDN:

The epilog code must follow a strict set of rules for the unwind code to reliably unwind through exceptions and interrupts. These rules reduce the amount of unwind data required, because no extra data is needed to describe each epilog. Instead, the unwind code can determine that an epilog is being executed by scanning forward through a code stream to identify an epilog.

I get what they are saying, but I don't understand the practical implications, at least for exceptions. Given the limited set of instructions that a valid epilogue can contain, I don't see what type of exceptions could even occur in the first place. If we look at the possible instructions in the epilogue, with their most complex example:

   lea      RSP, -128[R13]
    ; epilogue proper starts here
    add      RSP, fixed-allocation-size
    pop      R13
    pop      R14
    pop      R15
    ret

We can't really do anything that would result in an exception being thrown. pop could result in stack-underflow, but if that's happening, we don't have a valid return-address left, and thus no valid unwind could occur. Or if RSP points to some invalid memory address, where we would once again not really have a valid target for ret. So we would, at most, invoke the top-level SEH handler for the particular function that the epilogue belongs to, in such a scenario.

Is the reason really just for interrupts (which I have no knowledge or understand of)? Or am I missing how an x64-compliant epilogue could trigger an exception? If the reasoning was just to not have exceptions in epilogues, they wouldn't need the whole machinery of scanning the code-stream to determine if an exception occurs within an epilogue.

As for why I'm asking this: I'm doing some trickery for my native codes stack-switching, that includes generating technically invalid epilogues (and prologs, to an extent). I'm trying to make sure that I don't run into unexpected side-effects, as long as I make sure not to allow recoverable exceptions to occur there (for a normal user-mode application).

Solution

The Key Point to realize is that Unwinding and Stack Walking are basically the same thing.
Stack Walking is basically a non-destructive version of Unwinding.
And both use the same underlying mechanism: function tables and unwind codes.

So the primary usecase where you would encounter unwinding in epilogues is neither exceptions nor interrupts, but rather stack walks.

So if reliable stack traces (and therefore a good debugging / profiling experience) are important then you probably should ensure that your epilogues follow the rules (or alternatively generate lots of extra unwind table entries to handle unwinding from your custom epilogues)

Imagine the following scenario:
Your're debugging an app and it hasn't responded in a while but it uses a lot of CPU.
So you just hit Pause in your debugger to see what it is currently doing.
What does your debugger do in that case?

Well, most likely it suspends the thread, captures the content of the registers and then unwinds that context to capture each stack frame.

Crude example, probably doesn't work very well, just for illustration purposes:

void HaltAndDisplayStackTraceOfThread(HTHREAD thread) {
    SuspendThread(thread);
    CONTEXT ctx;
    GetThreadContext(thread, &context);

    UNWIND_HISTORY_TABLE table = {};
    while(ctx.Rip != 0) {
        // TODO: output ctx.Rip,
        //       search PDB files to get a fancy, user-friendly name for ctx.Rip,
        //       etc...

        DWORD64 base;
        RUNTIME_FUNCTION* fn = RtlLookupFunctionEntry(ctx.rip, &base, &table);
        if(fn == nullptr) {
            // leaf function, does not have any unwind codes
            ctx.Rip = *(DWORD64*)context.Rsp;
            ctx.Rsp += 8;
        } else {
            // unwind the current frame 
            void* ignore1;
            DWORD64 ignore2;
            RtlVirtualUnwind(
                UNW_FLAG_NHANDLER,
                base,
                ctx.Rip,
                fn,
                &ctx,
                &ignore1,
                &ignore2,
                nullptr
            );
        }
    }
}

The problem here is that we necessarly don't know where exactly the thread will stop executing - it might be within a function prologue, the main body of the function or somewhere within an epilogue.

Let's assume in your example the thread stopped here:

; .... actual function code goes here ...
add      RSP, fixed-allocation-size
pop      R13                             ; <= RIP points here
pop      R14
pop      R15
ret

Now RtlVirtualUnwind is in a rather awkward situation - RSP has already been modified by the add instruction so we can't use the normal unwind codes of the function (because those would tell us to add fixed-allocation-size again, which would mess up our stack).

So there are only 2 ways out of this:

Let RtlVirtualUnwind recognize when RIP lies within an epilogue, and manually simulate the remaining epilogue instructions to get RSP back to a known good value so that we can continue unwinding.
Add separate unwind codes for each instruction in each epilogue
e.g. in the case above we would need 5 separate unwind tables:
- the normal function unwind table that covers the prologue and the main body of the function, in this case probably:
  [UWOP_PUSH_NONVOL, UWOP_PUSH_NONVOL, UWOP_PUSH_NONVOL, UWOP_ALLOC_SMALL]
- a separate one for the case where RIP points after the add instruction in the epilogue: [UWOP_PUSH_NONVOL, UWOP_PUSH_NONVOL, UWOP_PUSH_NONVOL]
- ... after pop R13: [UWOP_PUSH_NONVOL, UWOP_PUSH_NONVOL]
- ... after pop R14: [UWOP_PUSH_NONVOL]
- and finally, after pop R15: []
  (only the ret instruction is left, so we don't need to do any stack adjustments)
That is not very space efficient - now we need an insane amount of unwind tables to handle each instruction in the epilogues.
(Note that we can't just add an "offset in epilogue" to UNWIND_CODE like it's done to handle unwinding from within a prolog - because there can be multiple epilogues per function)

The same thing also applies to stack-based profiling (which also randomly stops threads and samples the stack to get a stacktrace), and probably a lot of other usecases that need stacktraces.

I don't know much about the Windows Kernel Side, but I'd guess they most likely use stack-based sampling there as well for performance profiling.
(its probably even easier in the kernel because you can just set a timer interrupt that periodically samples the stack)

That also explains why there's an unwind code for interrupt handlers (UWOP_PUSH_MACHFRAME). If you attempt to throw an exception from an interrupt handler windows just bluescreens with INTERRUPT_EXCEPTION_NOT_HANDLED, so you can never unwind out of an interrupt handler in practice.
But that unwind code still exists, because it is necessary to get a proper stack trace from within an interrupt handler.

For reference kernrate seems like it does use an interrupt to sample the current stack.

References

In case you're interested how the unwind algorithm works in practice, you can take a look at its implementation in the Windows Research Kernel, that contains the full implementation of RtlVirtualUnwind:
WRK-v1.2 base/ntos/rtl/amd64/exdsptch.c
(InEpilogue is the boolean that tracks wether we're in an epilogue or not)

There's also a textual description of that algorithm available in the microsoft documentation:
x64 exception handling - Unwind procedure