Why can't 64-bit Windows unwind the stack during an exception, if the stack crosses the kernel boundary - when 32-bit Windows can?
The context of this entire question comes from:
The case of the disappearing OnLoad exception – user-mode callback exceptions in x64
In 32-bit Windows, if i throw an exception in my user mode code, that was called back from kernel mode code, that was called from my user mode code, e.g:
User mode Kernel Mode
------------------ -------------------
CreateWindow(...); ------> NtCreateWindow(...)
|
WindowProc <---------------------+
the Structured Exception Handling (SEH) in Windows can unwind the stack, unwinding back through kernel mode, back into my user code, where i can handle the exception and i see a valid stack trace.
64-bit editions of Windows cannot do this:
For complicated reasons, we cannot propagate the exception back on 64-bit operating systems (amd64 and IA64). This has been the case ever since the first 64-bit release of Server 2003. On x86, this isn’t the case – the exception gets propagated through the kernel boundary and would end up walking the frames back
And since there's no way to walk back a reliable stack trace in this case, the had to make a decision: let you see the non-nonsensical exception, or hide it altogether:
The kernel architects at the time decided to take the conservative AppCompat-friendly approach – hide the exception, and hope for the best.
The article goes on to talk about how this was how all 64-bit Windows operating systems behaved:
But starting with Windows 7 (and Windows Server 2008), the architects changed their minds - sort of. For only 64-bit applications (not 32-bit applications), they would (by default) stop suppressing these user-kernel-user exceptions. So, by default, on:
all 64-bit applications will see these exceptions, where they never used to see them.
In Windows 7, when a native x64 application crashes in this fashion, the Program Compatibility Assistant is notified. If the application doesn’t have a Windows 7 Manifest, we show a dialog telling you that PCA has applied an Application Compatibility shim. What does this mean? This means, that the next time you run your application, Windows will emulate the Server 2003 behavior and make the exception disappear. Keep in mind, that PCA doesn’t exist on Server 2008 R2, so this advice doesn’t apply.
The question is why is 64-bit Windows unable to unwind a stack back through a kernel transition, while 32-bit editions of Windows can?
The only hint is:
For complicated reasons, we cannot propagate the exception back on 64-bit operating systems (amd64 and IA64).
The hint is it's complicated.
i may not understand the explanation, as i'm not an operating system developer - but i'd like a shot at knowing why.
Microsoft has released a hotfix enables 32-bit applications to also no longer have the exceptions suppressed:
KB976038: Exceptions that are thrown from an application that runs in a 64-bit version of Windows are ignored
- An exception that is thrown in a callback routine runs in the user mode.
In this scenario, this exception does not cause the application to crash. Instead, the application enters into an inconsistent state. Then, the application throws a different exception and crashes.
A user mode callback function is typically an application-defined function that is called by a kernel mode component. Examples of user mode callback functions are Windows procedures and hook procedures. These functions are called by Windows to process Windows messages or to process Windows hook events.
The hotfix then lets you stop Windows from eating the exceptions globally:
HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options DisableUserModeCallbackFilter: DWORD = 1
or per-application:
HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\Notepad.exe DisableUserModeCallbackFilter: DWORD = 1
The behavior was also documented on XP and Server 2003 in KB973460:
i found another hint when investigating using xperf to capture stack traces on 64-bit Windows:
Disable Paging Executive
In order for tracing to work on 64-bit Windows you need to set the DisablePagingExecutive registry key. This tells the operating system not to page kernel mode drivers and system code to disk, which is a prerequisite for getting 64-bit call stacks using xperf, because 64-bit stack walking depends on metadata in the executable images, and in some situations the xperf stack walk code is not allowed to touch paged out pages. Running the following command from an elevated command prompt will set this registry key for you.
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
After setting this registry key you will need to reboot your system before you can record call stacks. Having this flag set means that the Windows kernel locks more pages into RAM, so this will probably consume about 10 MB of additional physical memory.
This gives the impression that in 64-bit Windows (and only in 64-bit Windows), you are not allowed to walk kernel stacks because there might be pages out on disk.
I'm the developer who wrote this Hotfix a loooooooong time ago as well as the blog post. The main reason is that the full register file isn't always captured when you transition into kernel space, for performance reasons.
If you make a normal syscall, the x64 Application Binary Interface (ABI) only requires you to preserve the non-volatile registers (similar to making a normal function call). However, correctly unwinding the exception requires you to have all the registers, so it's not possible. Basically, this was a choice between perf in a critical scenario (i.e. a scenario that potentially happens thousands of times per second) vs. 100% correctly handling a pathological scenario (a crash).