I am trying to improve my understanding of a SIGBUS error in Xwayland. This has been seen by several Fedora Linux users since around the 20th of February 2018, with Xwayland 1.19.6-5.fc27.x86_64
and Linux kernel 4.15.3-300.fc27.x86-64
.
Sadly I do not have the kernel "segfault" log message (or equivalent for SIGBUS). Xwayland has some pointless code which traps the fatal signal. But I can see siginfo
by debugging the coredump, and this seems to be nearly as good.
I understand that a "major page fault" occurs when a page of virtual memory is not available in RAM, and must be read from disk. I think I'm specifically interested in pages backed by a ext4 filesystem (e.g. no direct access to block devices) for this question.
Therefore a "minor page fault" is when no disk access is necessary. I assume the difference is fairly well-defined because Linux exposes counters for major and minor page faults.
If the kernel sends a program SIGBUS, I wonder if I should generally expect that this would have been a major page fault.
According to the coredump and disassembly, the program is reading memory when it receives SIGBUS, not writing it. The fault address in siginfo->si_addr
is within a mapped system executable, which is not writeable by the user, and the address seems within the bounds of the current file length. In fact when debugging the coredump, I have read very convincing values from the memory address. It seems the coredump generation process had no difficulty reading this address :-(.
I'm also confident in ruling out the "invalid address alignment" case (BUS_ADRALN), because siginfo->si_code
is 2, i.e. BUS_ADRERR, "non-existent physical address". Also because I'm on x86, which permits unaligned accesses in most cases, and the trap isn't in any SSE extended instruction.
I considered what the kernel is normally responsible for, when it handles a page fault which it determines is "minor". I suppose minor faults could fail to allocate memory, and hence raise SIGBUS. However, I believe I would have noticed such an allocation failure:
I have plenty of free swap to evict user pages to, and I did not notice the usual obvious slowdown that occurs when my system starts swapping. The crash happened a few seconds after waking a laptop from suspend to ram, which would not have been long enough to fill 8GB of swap even at ~100MB/s. Nor did I see the dread Out Of Memory (OOM) killer appear in kernel logs, as I would expect if the kernel failed allocating a page frame or page table.
Is there some other possibility that a minor page fault could have failed and caused the SIGBUS? I.e. is there some cause which I would not have noticed, when looking for errors in the kernel log? And which could have a quick onset?
Again, multiple coredumps are showing this as a page fault triggered by reading from a mapped file on the filesystem.
I would really like to have missed a case for minor page faults. Because the horrifying flipside of this is that I don't see how this SIGBUS could have been cause by the hard page fault side of things. Several of us users have very similar-looking errors, starting a few months ago. There is no IO error in my kernel logs. During normal operation, I have no IO errors when reading the indicated file. I have no errors when running rpm --verify --all
, or when running an extended SMART test on the HDD. Unfortunately I seem to have very few suspects. The closest suspect I have is a kernel upgrade, which I would obviously prefer to rule out; the dates don't exactly prove it but it's not entirely ruled out. Next closest in the dates is this years microcode update; this seems like it would be even harder to nail down.
But more generally, it could be any first access to a page. This is because it seems that the page tables for memory mappings are generally populated on-demand. (Which would be done by a page fault, and if the file page was already in cache, it would only be a minor page fault).
MAP_NONBLOCK (since Linux 2.5.46)
This flag is meaningful only in conjunction with MAP_POPULATE. Don't perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combina‐ tion of MAP_POPULATE and MAP_NONBLOCK may be reimplemented.
A commenter asked for more concrete details, to clarify the faulting address and instruction. There are many excerpts in the initial link https://bugzilla.redhat.com/show_bug.cgi?id=1557682
The fault varies as described in the bug link. Here are fresh excerpts from a recent instance.
$ gdb 2018-03-21.core
...
Core was generated by `/usr/bin/Xwayland :0 -rootless -terminate -core -listen 4 -listen 5 -displayfd'.
Program terminated with signal SIGBUS, Bus error.
#0 _dl_fixup (l=0x7fc0be2e0130, reloc_arg=203) at ../elf/dl-runtime.c:73
73 const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)];
[Current thread is 1 (Thread 0x7fc0be29fa80 (LWP 1918))]
(gdb) p $_siginfo.si_signum
$1 = 7
(gdb) p $_siginfo.si_code
$2 = 2
(gdb) p $_siginfo._sifields._sigfault.si_addr
$3 = (void *) 0x41bd80
(gdb) disassemble
Dump of assembler code for function _dl_fixup:
0x00007fc0be0c8bd0 <+0>: push %rbx
0x00007fc0be0c8bd1 <+1>: mov %rdi,%r10
0x00007fc0be0c8bd4 <+4>: mov %esi,%esi
0x00007fc0be0c8bd6 <+6>: lea (%rsi,%rsi,2),%rdx
0x00007fc0be0c8bda <+10>: sub $0x10,%rsp
0x00007fc0be0c8bde <+14>: mov 0x68(%rdi),%rax
0x00007fc0be0c8be2 <+18>: mov 0x8(%rax),%rdi
0x00007fc0be0c8be6 <+22>: mov 0xf8(%r10),%rax
0x00007fc0be0c8bed <+29>: mov 0x8(%rax),%rax
0x00007fc0be0c8bf1 <+33>: lea (%rax,%rdx,8),%r8
0x00007fc0be0c8bf5 <+37>: mov 0x70(%r10),%rax
=> 0x00007fc0be0c8bf9 <+41>: mov 0x8(%r8),%rcx
(gdb) p/x $r8
$4 = 0x41bd78
(gdb) p/x $r8 + 8
$5 = 0x41bd80
Note this instruction is fetching the value reloc->r_info
as per the highlighted source line.
(gdb) p reloc
$6 = (const Elf64_Rela * const) 0x41bd78
(gdb) p &reloc->r_info
$7 = (Elf64_Xword *) 0x41bd80
(gdb) p *reloc
$8 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0}
The faulting address falls within the text mapping below (from maps
file captured by abrtd
):
00400000-0060b000 r-xp 00000000 fd:00 1708508 /usr/bin/Xwayland
0080a000-0080d000 r--p 0020a000 fd:00 1708508 /usr/bin/Xwayland
0080d000-00817000 rw-p 0020d000 fd:00 1708508 /usr/bin/Xwayland
$ size -x /usr/bin/Xwayland
text data bss dec hex filename
0x209ffb 0xbe9d 0x1f3e0 2314872 235278 /usr/bin/Xwayland
Thanks everyone for your support. It was indeed a transient IO error. It seems the SIGBUS read-fault path doesn't necessarily log anything in the kernel log, unlike the cases I'm used to seeing for IO errors.
https://marc.info/?l=linux-ide&m=152232081917215&w=2
v4.15 intermittent errors on suspend/resume
To anyone waiting for the other show to drop on the SATA LPM work...
I've found something that's at least in the same area. It triggered a fsck on my system 2 days ago. Evidence suggests it's occurred on many other machines. I felt that was reason enough to give you a heads up :).
I checked and I don't seem to have LPM enabled during runtime, even when running on battery. My errors are all on suspend/resume, so maybe that behaviour was changed at the same time?
It doesn't always show in kernel logs. What I first noticed was a mysterious SIGBUS that kills Xwayland (and hence the entire Gnome session) on resume from suspend. It surprised me to learn that this SIGBUS can happen, without leaving anything like the read errors I'm used to seeing in the kernel log!
My coredumps show the SIGBUS fault address is an instruction read inside the program code of Xwayland. The backtraces vary along the same call chain - the common factor is that they're always at the first instruction of the function. I assume it varies according to which page is not currently in-core, and hence triggers the failing read request.
There are hundreds of backtraces along this same call chain from other users, reported automatically to Fedora, that look the same. At least so far we don't have any more plausible for them. I admit it's funny that Xwayland is so prominent, and I haven't been swamped with SIGBUS in other processes, but I stand by this analysis.
These crashes started within 24 hours of Fedora upgrading to kernel v4.15.
Fedora bug for the Xwayland SIGBUS: https://bugzilla.redhat.com/show_bug.cgi?id=1553979
My duplicate bug I've been spamming with puzzled comments: https://bugzilla.redhat.com/show_bug.cgi?id=1557682
The earliest and biggest of the many crash report buckets:
- [2018-02-17] https://retrace.fedoraproject.org/faf/reports/2049964/
- [315 reports] https://retrace.fedoraproject.org/faf/reports/2055378/
EXT4 filesystem error:
Mar 27 11:28:30 alan-laptop kernel: PM: suspend exit ... Mar 27 11:28:30 alan-laptop kernel: EXT4-fs error (device dm-2): ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0 Mar 27 11:28:30 alan-laptop kernel: Buffer I/O error on dev dm-2, logical block 0, lost sync page write (this marked the FS as needing fsck on next boot)
More frequently, it logs these swap errors:
Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ... Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184) Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)
My laptop LPM status, even after removing AC power:
$ head /sys/class/scsi_host/host*/link_power_management_policy ==> /sys/class/scsi_host/host0/link_power_management_policy <== max_performance ==> /sys/class/scsi_host/host1/link_power_management_policy <== max_performance
My laptop is a Dell Lattitude E5450. CPU is i5-5300U (a Broadwell).