Can I rule out that SIGBUS is raised by a "minor page fault"? (Kernel log has no allocation failure)

Asked 25/3, 2018 at 15:11 Answered 29/3, 2018 at 11:4

Motivation

I am trying to improve my understanding of a SIGBUS error in Xwayland. This has been seen by several Fedora Linux users since around the 20th of February 2018, with Xwayland 1.19.6-5.fc27.x86_64 and Linux kernel 4.15.3-300.fc27.x86-64.

Sadly I do not have the kernel "segfault" log message (or equivalent for SIGBUS). Xwayland has some pointless code which traps the fatal signal. But I can see siginfo by debugging the coredump, and this seems to be nearly as good.

Definition

I understand that a "major page fault" occurs when a page of virtual memory is not available in RAM, and must be read from disk. I think I'm specifically interested in pages backed by a ext4 filesystem (e.g. no direct access to block devices) for this question.

Therefore a "minor page fault" is when no disk access is necessary. I assume the difference is fairly well-defined because Linux exposes counters for major and minor page faults.

My question

If the kernel sends a program SIGBUS, I wonder if I should generally expect that this would have been a major page fault.

According to the coredump and disassembly, the program is reading memory when it receives SIGBUS, not writing it. The fault address in siginfo->si_addr is within a mapped system executable, which is not writeable by the user, and the address seems within the bounds of the current file length. In fact when debugging the coredump, I have read very convincing values from the memory address. It seems the coredump generation process had no difficulty reading this address :-(.

I'm also confident in ruling out the "invalid address alignment" case (BUS_ADRALN), because siginfo->si_code is 2, i.e. BUS_ADRERR, "non-existent physical address". Also because I'm on x86, which permits unaligned accesses in most cases, and the trap isn't in any SSE extended instruction.

I considered what the kernel is normally responsible for, when it handles a page fault which it determines is "minor". I suppose minor faults could fail to allocate memory, and hence raise SIGBUS. However, I believe I would have noticed such an allocation failure:

I have plenty of free swap to evict user pages to, and I did not notice the usual obvious slowdown that occurs when my system starts swapping. The crash happened a few seconds after waking a laptop from suspend to ram, which would not have been long enough to fill 8GB of swap even at ~100MB/s. Nor did I see the dread Out Of Memory (OOM) killer appear in kernel logs, as I would expect if the kernel failed allocating a page frame or page table.

Is there some other possibility that a minor page fault could have failed and caused the SIGBUS? I.e. is there some cause which I would not have noticed, when looking for errors in the kernel log? And which could have a quick onset?

Again, multiple coredumps are showing this as a page fault triggered by reading from a mapped file on the filesystem.

Ulterior motivation

I would really like to have missed a case for minor page faults. Because the horrifying flipside of this is that I don't see how this SIGBUS could have been cause by the hard page fault side of things. Several of us users have very similar-looking errors, starting a few months ago. There is no IO error in my kernel logs. During normal operation, I have no IO errors when reading the indicated file. I have no errors when running rpm --verify --all, or when running an extended SMART test on the HDD. Unfortunately I seem to have very few suspects. The closest suspect I have is a kernel upgrade, which I would obviously prefer to rule out; the dates don't exactly prove it but it's not entirely ruled out. Next closest in the dates is this years microcode update; this seems like it would be even harder to nail down.

Known causes of minor page faults

Logically, it sounds like minor page faults occur when implementing copy-on-write for MAP_PRIVATE mappings.
It should also include read faults on /dev/zero or MAP_ANONYMOUS, assuming a kernel did not implement them as reading a shared zero page and did not implement them to allocate pages for the entire mapping immediately.
But more generally, it could be any first access to a page. This is because it seems that the page tables for memory mappings are generally populated on-demand. (Which would be done by a page fault, and if the file page was already in cache, it would only be a minor page fault).

MAP_NONBLOCK (since Linux 2.5.46)

This flag is meaningful only in conjunction with MAP_POPULATE. Don't perform read-ahead: create page tables entries only for pages that are already present in RAM. Since Linux 2.6.23, this flag causes MAP_POPULATE to do nothing. One day, the combina‐ tion of MAP_POPULATE and MAP_NONBLOCK may be reimplemented.

EDIT: Further excerpts detailing the above

A commenter asked for more concrete details, to clarify the faulting address and instruction. There are many excerpts in the initial link https://bugzilla.redhat.com/show_bug.cgi?id=1557682

The fault varies as described in the bug link. Here are fresh excerpts from a recent instance.

$ gdb 2018-03-21.core
...
Core was generated by `/usr/bin/Xwayland :0 -rootless -terminate -core -listen 4 -listen 5 -displayfd'.
Program terminated with signal SIGBUS, Bus error.
#0  _dl_fixup (l=0x7fc0be2e0130, reloc_arg=203) at ../elf/dl-runtime.c:73
73    const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)];
[Current thread is 1 (Thread 0x7fc0be29fa80 (LWP 1918))]
(gdb) p $_siginfo.si_signum
$1 = 7
(gdb) p $_siginfo.si_code
$2 = 2
(gdb) p $_siginfo._sifields._sigfault.si_addr
$3 = (void *) 0x41bd80
(gdb) disassemble
Dump of assembler code for function _dl_fixup:
   0x00007fc0be0c8bd0 <+0>: push   %rbx
   0x00007fc0be0c8bd1 <+1>: mov    %rdi,%r10
   0x00007fc0be0c8bd4 <+4>: mov    %esi,%esi
   0x00007fc0be0c8bd6 <+6>: lea    (%rsi,%rsi,2),%rdx
   0x00007fc0be0c8bda <+10>:    sub    $0x10,%rsp
   0x00007fc0be0c8bde <+14>:    mov    0x68(%rdi),%rax
   0x00007fc0be0c8be2 <+18>:    mov    0x8(%rax),%rdi
   0x00007fc0be0c8be6 <+22>:    mov    0xf8(%r10),%rax
   0x00007fc0be0c8bed <+29>:    mov    0x8(%rax),%rax
   0x00007fc0be0c8bf1 <+33>:    lea    (%rax,%rdx,8),%r8
   0x00007fc0be0c8bf5 <+37>:    mov    0x70(%r10),%rax
=> 0x00007fc0be0c8bf9 <+41>:    mov    0x8(%r8),%rcx
(gdb) p/x $r8
$4 = 0x41bd78
(gdb) p/x $r8 + 8
$5 = 0x41bd80

Note this instruction is fetching the value reloc->r_info as per the highlighted source line.

(gdb) p reloc
$6 = (const Elf64_Rela * const) 0x41bd78
(gdb) p &reloc->r_info
$7 = (Elf64_Xword *) 0x41bd80
(gdb) p *reloc
$8 = {r_offset = 8443504, r_info = 936302870535, r_addend = 0}

The faulting address falls within the text mapping below (from maps file captured by abrtd):

00400000-0060b000 r-xp 00000000 fd:00 1708508                            /usr/bin/Xwayland
0080a000-0080d000 r--p 0020a000 fd:00 1708508                            /usr/bin/Xwayland
0080d000-00817000 rw-p 0020d000 fd:00 1708508                            /usr/bin/Xwayland

$ size -x /usr/bin/Xwayland
   text    data     bss     dec     hex filename
0x209ffb     0xbe9d 0x1f3e0 2314872  235278 /usr/bin/Xwayland

Glendoraglendower answered 25/3, 2018 at 15:11 Comment(16)

You seem to be operating under a misconception. Neither minor nor major page faults cause SIGBUS. Minor and major page faults are both handled entirely within the kernel and do not send any signal at all. SIGBUS and SIGSEGV only happen when the kernel can't fill in a missing page itself. – Galah 25/3, 2018 at 15:18

Given your problem description, I think the most likely explanation is a kernel bug, and the second most likely explanation is a transient I/O error (temporary failure to reread page from disk) but that should have left a trace in the logs. Your first priority should be to find a reliable reproducer - this is an intermittent failure, yes? – Galah 25/3, 2018 at 15:20

@Galah this is an intermittent failure when resuming from suspend. I haven't noticed a pattern that gets me to 100% reproduction unfortunately. – Glendoraglendower 25/3, 2018 at 15:23

@Galah I deliberately started with the phrase "SIGBUS is raised by", i.e. within, the page fault. Might have slipped up somewhere in this writeup though. – Glendoraglendower 25/3, 2018 at 15:25

The thing is there's only one handler for the CPU's "page fault" exception. I am not familiar enough with Linux's guts to know whether it makes sense to talk about a SIGBUS being raised from within a control-flow path that would, if not for some problem, normally lead to the exception being counted as a major or minor fault, but I am inclined to think that this is not the right debugging strategy to take. – Galah 25/3, 2018 at 15:28

Related: Debugging SIGBUS on x86 Linux: you can get it from non-canonical pointers on x86-64 (upper 16 bits isn't the sign extension of the low 48), or from out-of-bounds mmap. Possible duplicate, I think. What instruction did cause sigbus, and what values were in the registers it used for an addressing mode? – Roesler 25/3, 2018 at 21:37

x86, which doesn't care about alignment in most cases I beg to differ. How about an unaligned acces crossing a page boundary? – Endbrain 25/3, 2018 at 23:18

@PeterCordes thanks, the one answer there has really nice details. My faulting addresses are canonical AND not via %rsp or %rbp. I have edited in an example with gory detail. – Glendoraglendower 25/3, 2018 at 23:21

@Endbrain editted it, is it better? – Glendoraglendower 25/3, 2018 at 23:26

@wildplasser: into an unmapped page, you mean? That would give you SIGSEGV. And that's not "most cases", and can only happen when manually vectorizing unsafely. If the object you're loading crosses a page boundary, then both pages much be mapped. It can be a perf problem (especially on CPUs before Skylake), but not a correctness problem. – Roesler 25/3, 2018 at 23:30

I mean: reading an int @ page+4094 could cause a pagefault--> pagedin -> buserror. – Endbrain 25/3, 2018 at 23:41

@sourcejedi: That's super weird. mov 0x8(%r8),%rcx looks normal, and r8 is 8-byte aligned (not that it matters), and pointing to a valid part of the mapping. Did you check your kernel log for I/O errors? Have you verified the XWayland file itself to see if it reads ok? Maybe try copying it to /tmp and running from there to avoid problems with a bad disk. Does XWayland fault every time, or just occasionally? – Roesler 25/3, 2018 at 23:41

@wildplasser: not sure what you mean. A page-split load will SIGSEGV if either page is unmapped. A load from an invalid part of a file-backed mapping will SIGBUS whether it's a page-split or not. So I don't see why a page-split load would be special. They can be slow but x86 fully supports them in HW. – Roesler 25/3, 2018 at 23:43

I rest my case. – Endbrain 25/3, 2018 at 23:56

Is it possible that some(one,thing) truncated Xwayland while it was running? That can result in a SIGBUS. – Sommelier 26/3, 2018 at 1:53

@PeterCordes I looked for relevant errors in my kernel log including IO errors, and used rpm --verify to verify all system files and Xwayland specifically against package checksums, as specifically mentioned; I don't see anything. It appears to fault about 10ms after the system wakes from suspend, but not every time. @Sommelier I think it is extremely unlikely that a privileged process is truncating /usr/bin/Xwayland, and then resetting the size with the correct data before I run gdb on the coredump. – Glendoraglendower 26/3, 2018 at 8:52

Thanks everyone for your support. It was indeed a transient IO error. It seems the SIGBUS read-fault path doesn't necessarily log anything in the kernel log, unlike the cases I'm used to seeing for IO errors.

https://marc.info/?l=linux-ide&m=152232081917215&w=2

v4.15 intermittent errors on suspend/resume

To anyone waiting for the other show to drop on the SATA LPM work...

I've found something that's at least in the same area. It triggered a fsck on my system 2 days ago. Evidence suggests it's occurred on many other machines. I felt that was reason enough to give you a heads up :).

I checked and I don't seem to have LPM enabled during runtime, even when running on battery. My errors are all on suspend/resume, so maybe that behaviour was changed at the same time?

It doesn't always show in kernel logs. What I first noticed was a mysterious SIGBUS that kills Xwayland (and hence the entire Gnome session) on resume from suspend. It surprised me to learn that this SIGBUS can happen, without leaving anything like the read errors I'm used to seeing in the kernel log!

My coredumps show the SIGBUS fault address is an instruction read inside the program code of Xwayland. The backtraces vary along the same call chain - the common factor is that they're always at the first instruction of the function. I assume it varies according to which page is not currently in-core, and hence triggers the failing read request.

There are hundreds of backtraces along this same call chain from other users, reported automatically to Fedora, that look the same. At least so far we don't have any more plausible for them. I admit it's funny that Xwayland is so prominent, and I haven't been swamped with SIGBUS in other processes, but I stand by this analysis.

These crashes started within 24 hours of Fedora upgrading to kernel v4.15.

Fedora bug for the Xwayland SIGBUS: https://bugzilla.redhat.com/show_bug.cgi?id=1553979

My duplicate bug I've been spamming with puzzled comments: https://bugzilla.redhat.com/show_bug.cgi?id=1557682

The earliest and biggest of the many crash report buckets:

[2018-02-17] https://retrace.fedoraproject.org/faf/reports/2049964/

[315 reports] https://retrace.fedoraproject.org/faf/reports/2055378/

EXT4 filesystem error:
Mar 27 11:28:30 alan-laptop kernel: PM: suspend exit
...
Mar 27 11:28:30 alan-laptop kernel: EXT4-fs error (device dm-2):  ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Mar 27 11:28:30 alan-laptop kernel: Buffer I/O error on dev dm-2, logical block 0, lost sync page write
(this marked the FS as needing fsck on next boot)
More frequently, it logs these swap errors:
Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ...
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)
My laptop LPM status, even after removing AC power:
$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
max_performance

==> /sys/class/scsi_host/host1/link_power_management_policy <==
max_performance
My laptop is a Dell Lattitude E5450. CPU is i5-5300U (a Broadwell).

Glendoraglendower answered 29/3, 2018 at 11:4 Comment(0)

I certainly have some bug in the kernel, unless it is a bug in the kernel selftests.

EDIT: hmm, actually it seems others have noticed the GS selftests failure recently as well, but it was already present in older kernels, and also appears on AMD cpus. There does not appear to be a conclusion regarding how to fix it at the moment. https://lkml.org/lkml/2018/1/26/436

So it's not this bug on it's own, though I can't rule out that this GS bug causes more prominent breakage when PTI is enabled or something.

$ uname -r
4.15.10-300.fc27.x86_64

$ git describe --all
heads/4.15.10
$ cat ./Documentation/x86/pti.txt
...
2. Run several copies of all of the tools/testing/selftests/x86/ tests
   (excluding MPX and protection_keys) in a loop on multiple CPUs for
   several minutes.  These tests frequently uncover corner cases in the
   kernel entry code.  In general, old kernels might cause these tests
   themselves to crash, but they should never crash the kernel.

$ cd tools/testing/selftests/x86
$ make
...

In 4x terminals to match my 4x hardware threads:

sh -c ' while true; do for i in *; do if test -x $i; then ./$i || exit; fi ; done; done '

Failures quickly appear:

[RUN]   ARCH_SET_GS(0x200000000), then schedule to 0x200000000
    Before schedule, set selector to 0x3
    other thread: ARCH_SET_GS(0x200000000) -- sel is 0x0
[FAIL]  GS/BASE changed from 0x3/0x0 to 0x0/0x0

Also

[RUN]   Executing 6-argument 32-bit syscall via VDSO
[WARN]  Flags before=0000000000200ed7 id 0 00 o d i s z 0 a 0 p 1 c
[WARN]  Flags  after=0000000000200682 id 0 00 d i s 0 0 1 
[WARN]  Flags change=0000000000000855 0 00 o z 0 a 0 p 0 c
[OK]    Arguments are preserved across syscall
[NOTE]  R11 has changed:0000000000200682 - assuming clobbered by SYSRET insn
[OK]    R8..R15 did not leak kernel data
[RUN]   Executing 6-argument 32-bit syscall via INT 80
[OK]    Arguments are preserved across syscall
[OK]    R8..R15 did not leak kernel data
[RUN]   Running tests under ptrace
[RUN]   Executing 6-argument 32-bit syscall via VDSO
[WARN]  Flags before=0000000000200ed7 id 0 00 o d i s z 0 a 0 p 1 c
[WARN]  Flags  after=0000000000200686 id 0 00 d i s 0 0 p 1 
[WARN]  Flags change=0000000000000851 0 00 o z 0 a 0 0 c
[OK]    Arguments are preserved across syscall
[NOTE]  R11 has changed:0000000000200686 - assuming clobbered by SYSRET insn
[OK]    R8..R15 did not leak kernel data
[RUN]   Executing 6-argument 32-bit syscall via INT 80
[OK]    Arguments are preserved across syscall
[OK]    R8..R15 did not leak kernel data
Warning: failed to find getcpu in vDSO
[RUN]   Testing getcpu...
[OK]    CPU 0: syscall: cpu 0, node 0
[OK]    CPU 1: syscall: cpu 1, node 0
[OK]    CPU 2: syscall: cpu 2, node 0
[OK]    CPU 3: syscall: cpu 3, node 0
[RUN]   Testing getcpu...
[OK]    CPU 0: syscall: cpu 0, node 0 vdso: cpu 0, node 0 vsyscall: cpu 0, node 0
[OK]    CPU 1: syscall: cpu 1, node 0 vdso: cpu 1, node 0 vsyscall: cpu 1, node 0
[OK]    CPU 2: syscall: cpu 2, node 0 vdso: cpu 2, node 0 vsyscall: cpu 2, node 0
[OK]    CPU 3: syscall: cpu 3, node 0 vdso: cpu 3, node 0 vsyscall: cpu 3, node 0
[NOTE]  failed to find getcpu in vDSO
[RUN]   test gettimeofday()
    vDSO time offsets: 0.000006 0.000000
[OK]    vDSO gettimeofday()'s timeval was okay
[RUN]   test time()
[FAIL]  vDSO returned the wrong time (1522063297 1522063296 1522063297)

Glendoraglendower answered 26/3, 2018 at 11:32 Comment(7)

Is this an answer to your own question? Or additional information that should be added to the question itself? – Aureliaaurelian 26/3, 2018 at 12:43

@DaanTimmer: Looks like an answer to me. A kernel bug could make the kernel think a pagefault for a file-backed mapping should SIGBUS instead of fetching the file from disk. The OP says this happens after suspend, so pagecache was flushed and access to the .text or .rodata of the process will cause a hard page fault that needs to page in data from the file. Looks like the kernel thinks that part of the file isn't there, and fires a SIGBUS instead. – Roesler 26/3, 2018 at 12:51

@DaanTimmer not sure :(. I thought this helped show that there was a bug in the recent kernel PTI changes, which would have been a major suspect. However the LKML thread claims this bug is older than PTI. So it is not very helpful in showing a root cause. – Glendoraglendower 26/3, 2018 at 13:50

@PeterCordes page cache is not flushed in suspend-to-ram. – Glendoraglendower 26/3, 2018 at 13:50

I'm curious how flushing the page cache prior to suspending the system affects this issue. That might be useful in tracking it down. – Batton 26/3, 2018 at 14:3

@AndrewHenle good point. I tried sync; echo 3 | sudo tee /proc/sys/vm/drop_caches. It seems to have little effect on sudo page-types -f /usr/bin/Xwayland or mincore /usr/bin/Xwayland, even when run immediately after. It only drops 16 pages, and page-types shows that they were the ones that were not mmap()'ed. So it does not seem that it can affect this crash. – Glendoraglendower 26/3, 2018 at 15:4

@PeterCordes I was just wondering, hence the question for clarification :-) As the first message stated "I certainly have some bug in the kernel, unless it is a bug in the kernel selftests." with some additional information, which looks more like additional info than an answer. But later edits have made it more clearly an answer ;-) – Aureliaaurelian 27/3, 2018 at 15:6