I have a pretty stupid question:
What does happen in case a kernel panic happens while handling a kernel panic?
Does the computer just crash? Where? In the "hardware"? in the BIOS? Does it just randomly executes whatever results in executing and potentially modifies stuff it isn't supposed to? What if it happens again?
What does it look like? Is the display able to show something or does it loose it's contents, because some "buffer" doesn't get updated?
I'd love to paraphrase Fermat here and say that I've found a wonderful compendium of knowledge for you but the margin of a SO answer is too narrow to contain it all ...
This is a complex question. First, the very premise, "kernel panic", is imprecise. The meaning of the term depends on the operating system, but even within the same OS kernel, there are different "kinds" of (fatal) aborts. To illustrate, most OS kernels allow kernel/driver code to explicitly call a function panic(...)
. But there are also implied fault handlers that would "panic", for example a pagefault that occurs due to a NULL
pointer dereference while executing kernel code is likely irrecoverable, and therefore the pagefault handler would abort ("panic") when that condition is identified. In addition, certain hardware knows "unhandleable" failure conditions (for which an operating system cannot register a codeblock to be called on occurance), such as x86's triple faults; such faults may just reset the CPU, even through mechanisms that bypass the OS entirely.
Second, what would, or should, happen "during" a kernel panic ? Without prejudice, let's assume a "simple" case - have the kernel log a hopefully informative diagnostic message, and then reset/reboot the system. That would mean quite a few steps have to be performed by the panic()
code:
All of these execute code, which can have bugs. Also, the "state freeze" might error or complete only partially, and then code not designed to run concurrently with each other (the panic code and, say, unrelated driver interrupts) could happen at the same time, interfering with each other. And, depending on implementation, some of these might run "asynchronously" to panic()
, i.e. not in the same thread context, or serially, for example after the "panic report" has finished. So additional bugs or diagnostic messages may intersperse the first, but also potentially follow it (as "secondaries", if you like to call it that).
And as mentioned, this is a "simple" case; an OS may want to take other measures intended to allow recovery - such as "fence" a panic to the driver that triggered it, and remove that / prevent that from loading, in order to bring the operating system "up" on a retry. How successful such strategies may be I cannot "speculatively quantify". Linux, for example, has a few kernel tunables related to "when to panic", and developers / kernel programmers may use them differently depending what issues they encounter.
Can such errors "recurse" or "loop" ? You might get an unhandled pagefault while the pagefault error code tries to print a stacktrace for a previous NULL
ptr dereference, and then another series of pagefaults because the stack got corrupted, and then another pagefault for a non-mapped address because the recursion eventually overflows the kernel stack. At which point the handlers will hopefully switch to a separate stack (on x86, "double fault"), and if you're lucky, print you a message saying "kernel stack overflow on thread ..." or some such. Or not ... and also, even if the panic sequence succeeds and the system resets, on next reboot, the same issue may reoccur, the panic may hit again, print-diags, reset, rinse-repeat ... again not something unseen at all.
Can such errors "hang" ? Yes, they can; code that called panic()
might hold resources such as locks that would be required by (later) parts of the panic handling, and if no specific measures are taken for panic code to "blast through" locks, a deadlock may occur. There are measures (to "break locks" when in-panic, to allow for recursive locking, or to retain a running watchdog timer to detect this and break out of it) that kernel code can take to mitigate this, but again it's operating system (and firmware) specific what is done in such cases. Linux, for example, knows a dozen different ways to "reboot" an x86 system, see reboot= kernel parameter
It's not unheard of that a "kernel panic" triggers a long long series of kernel diagnostic messages for "secondary" panics that occurred while recording the first diagnostics, or while attempting to reboot the system. You'll get a huge amount of log output; even, for example, if shutting down secondary (non-panicing) CPU cores failed, panic messages from threads running on multiple different CPU cores may interleave.
In an ideal world, "at some point" a system reset would break through, the machine would reboot, and hopefully admins and/or developers can make sense from the diagnostics. In a real world though, it's not unheard of that what alerts sysadmins to a machine "stuck in panicing" is a monitoring system event saying "I've not had metrics from this box for 20 minutes", or "> 1GB kernel console logs written from this machine in the last 10 minutes", that will then make someone, person or external monitoring agent, take action.
There is a bit of "art" in troubleshooting such issues - to look for how this started, where "the mess" began and how, instead of "just" blaming the hardware and de-racking the affected system. Whether that is necessary or useful in your environment I again cannot speculate on.
If you want to play with this yourself, the Linux kernel has several mechanisms that allow "injecting" panics, for example provoke crashes, or the KASAN test facilities. Build a kernel with the corresponding facilities, boot it in QEMU, "and see" (if you can combine them in some ways to trigger multiple and/or recursive/nested panics). Have fun!
Or, in short, "it can be complicated" :-)