Why are segfaults called faults (and not aborts) if they are not recoverable?
Asked Answered
L

2

3

My following understanding of the terminology is this

1) An interrupt
is "a notification" that is initiated by the hardware to call the OS to run its handlers

2) A trap
is "a notification" that is initiated by the software to call the OS to run its handlers

3) A fault
is an exception that is raised by the processor if an error has occurred but it is recoverable

4) An abort
is an exception that is raised by the processor if an error has occurred but it is non-recoverable

Why do we call it a segmentation fault and not a segmentation abort then?

A segmentation fault
is when your program attempts to access memory it has either not been assigned by the operating system, or is otherwise not allowed to access.

My experience (primarily while testing C code) is that anytime a program throws a segmentation fault it is back to the drawing board - is there a scenario where the programmer can actually "catch" the exception and do something useful with it?

Linis answered 21/3, 2018 at 0:38 Comment(1)
I see nothing in those definitions that state that a fault should be recoverable. It mentions that traps are but leaves the question of faults unresolved. In terms of what they're called, it's just a naming thing, you may as well ask why some people distinguish between ketchup, catsup and sauce :-)Jagannath
W
7

At a CPU level, modern OSes don't use x86 segment limits for memory protection. (And in fact they couldn't even if they wanted to in long mode (x86-64); segment base is fixed at 0 and limit at -1).

OSes use virtual memory page tables, so the real CPU exception on an out-of-bounds memory access is a page fault.

x86 manuals call this a #PF(fault-code) exception, e.g. see the list of exceptions add can raise. Fun fact: the x86 exception for access outside of a segment limit is #GP(0).

It's up to the OS's page-fault handler to decide how to handle it. Many #PF exceptions happen as part of normal operation:

  • copy-on-write mapping got written: copy the page and mark it writeable in the page table, then return to user-space to retry the instruction that faulted. (This is a type of "soft" aka "minor" page fault.)
  • other soft page fault, e.g. the kernel was lazy and didn't actually have the page table updated to reflect the mappings the process made. (e.g. mmap(2) without MAP_POPULATE).
  • hard page fault: find some physical memory and read the file from disk (a file mapping or from swap file/partition for anonymous pages).

After sorting out any of the above, update the page table that the CPU reads on its own, and invalidate that TLB entry if necessary. (e.g. valid but read-only changed to valid + read-write).

Only if the kernel finds that the process really doesn't logically have anything mapped to that address (or that it's a write to a read-only mapping) will the kernel deliver a SIGSEGV to the process. This is purely a software thing, after sorting out the cause of the hardware exception.


The English text for SIGSEGV (from strerror(3)) is "Segmentation Fault" on all Unix/Linux systems, so that's what's printed (by the shell) when a child process dies from that signal.

This term is well understood, so even though it mostly only exists for historical reasons and hardware doesn't use segmentation.

Note that you also get a SIGSEGV for stuff like trying to execute privileged instructions in user-space (like wbinvd or wrmsr (write model-specific register)). At a CPU level, the x86 exception is #GP(0) for privileged instructions when you're not in ring 0 (kernel mode).

Also for misaligned SSE instructions (like movaps), although some Unixes on other platforms send SIGBUS for misaligned accesses faults (e.g. Solaris on SPARC).


Why do we call it a segmentation fault and not a segmentation abort then?

It is recoverable. It doesn't crash the whole machine / kernel, it just means that user-space process tried to do something that the kernel doesn't allow.

Even for that process that segfaulted it can be recoverable. This is why it's a catchable signal, unlike SIGKILL. Usually you can't just resume execution, but you can usefully record where the fault was (e.g. print a precise exception error message and even a stack backtrace).

The signal handler for SIGSEGV could longjmp or whatever. Or if the SIGSEGV was expected, then modify the code or the pointer used for the load, before returning from the signal handler. (e.g. for a Meltdown exploit, although there are much more efficient techniques that do the chained loads in the shadow of a mispredict or something else that suppresses the exception, instead of actually letting the CPU raise an exception and catching the SIGSEGV the kernel delivers)

Most programming languages (other than assembly) aren't low-level enough to give well defined behaviour when optimizing around an access that might segfault in a way that would let you write a handler that recovers. This is why usually you don't do anything more than print an error message (and maybe a stack backtrace) in a SIGSEGV handler if you install one at all.


Some JIT compilers for sandboxed languages (like Javascript) use hardware memory access checks to eliminate NULL pointer checks. In the normal case there's no fault, so it doesn't matter how slow the faulting case is.

A Java JVM can turn a SIGSEGV received by a thread of the JVM into a NullPointerException for the Java code it's running, without any problems for the JVM.

A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.


Trap vs. abort

I don't think there's standard terminology to make that distinction. It depends what kind of recovery you're talking about. Obviously the OS can keep running after anything user-space can make the hardware do, otherwise unprivileged user-space could crash the machine.

Related: On When an interrupt occurs, what happens to instructions in the pipeline?, Andy Glew (CPU architect who worked on Intel's P6 microarchitecture) says "trap" is basically any interrupt that's caused by the code that's running (rather than an external signal), and happens synchronously. (e.g. when a faulting instruction reaches the retirement stage of the pipeline without an earlier branch-mispredict or other exception being detected first).

"Abort" isn't standard CPU-architecture terminology. Like I said, you want the OS to be able to continue no matter what, and only hardware failure or kernel bugs normally prevent that.

AFAIK, "abort" is not very standard operating-systems terminology either. Unix has signals, and some of them are uncatchable (like SIGKILL and SIGSTOP), but most can be caught.

SIGABRT can be caught by a signal handler. The process exits if the handler returns, so if you don't want that you can longjmp out of it. But AFAIK no error condition raises SIGABRT; it's only sent manually by software, e.g. by calling the abort() library function. (It often results in a stack backtrace.)


x86 exception terminology

If you look at x86 manuals or this exception table on the osdev wiki, there are specific meanings in this context (thanks to @MargaretBloom for the descriptions):

  • trap: raised after an instruction successfully completed, the return address points after the trapping inst. #DB debug and #OF overflow ( into) exceptions are traps. (Some sources of #DB are faults instead) . But int 0x80 or other software interrupt instructions are also traps, as is syscall (but it puts the return address in rcx instead of pushing it; syscall is not an exception, and thus not really a trap in this sense)

  • fault: raised after an attempted execution is made and then rolled back; the return address points to the faulting instruction. (Most exception types are faults)

  • abort is when the return address points to an unrelated location (i.e. for #DF double-fault and #MC machine-check). Triple fault can't be handled; it's what happens when the CPU hits an exception trying to run the double-fault handler, and really does stop the whole CPU.

Note that even Intel CPU architects like Andy Glew sometimes use the term "trap" more generally, I think meaning any synchronous exception, when using discussion computer-architecture theory. Don't expect people to stick to the above terminology unless you're actually talking about handling specific exceptions on x86. Although it is useful and sensible terminology, and you could use it in other contexts. But if you want to make the distinction, you should clarify what you mean by each term so everyone's on the same page.

Wordage answered 21/3, 2018 at 5:15 Comment(5)
Just as a quick sanity check while I have your attention. Do you think my own "definitions" in the terminology section are correct? Or do you disagree with some of them?Linis
@AlanSTACK: not really. Your trap definition is too narrow. For example, Andy Glew said (in the answer I linked) an exception can be "E.g. a page fault, or an undefined instruction trap." I think you could say that a hardware page fault is a trap. The term includes but isn't limited to intentional traps to make system calls, like x86's int or syscall instructions. (int is kind of an unfortunate name. trap would have been a better name than software-interrupt. MMIX's equivalent instruction is called trap.)Wordage
@AlanSTACK: and if you're talking about the CPU, it doesn't make assumptions about how user-space and kernel space interact / trust each other, and doesn't know about processes. So it doesn't make sense to call anything an abort except a hardware failure. (But x86 calls those "machine-check exceptions" for monitoring ECC memory and stuff). At the very least, the kernel can always print what caused the fault, or let a debugger hook the exception so you can manually modify code or data so it won't fault on retry.Wordage
If the answer is related to Intel machines only, then the terms trap, abort and fault have precise definitions: A trap is raised after an instruction successfully completed, the return address points after the trapping inst (Only #DB e #OF are traps). A fault is raised after an attempted execution is made and then rolled back, the return address points to the faulting inst (Almost everything is a trap). An abort is when the return address points to an unrelated location (i.e. for #DF and #MC). So #UD cannot be a trap.Panarabism
@MargaretBloom: Thanks, finally some precise definitions that make sense. Added a section at the bottom for x86 exception terminology. (Hopefully I got this mostly right; I don't really do x86 OS dev stuff, I just read about it on SO :P)Wordage
N
1

There are two types of exceptions: faults and traps. When a fault occurs, the instruction ca be restarted. When a trap occurs the instruction cannot be restarted.

For example, when a page fault occurs, the operating system exception handler loads the missing page and the restarts the instruction that caused the fault.

If the processor has defined a "segmentation fault" then the instruction causing the exception is restartable—but it is possible that the operating system's handler might not restart the instruction.

Nahum answered 21/3, 2018 at 4:8 Comment(3)
Is my understanding wrong or are you confusing trap with abort?Linis
@AlanSTACK: I think this answer is using trap to talk about instructions like int or syscall which raise an exception, but that's intended and the kernel should return to the instruction after. But faults (like page faults) should be handled by re-running the instruction after fixing the problem. This isn't great terminology, because what do you call an illegal instruction exception? You can restart it, but it will cause another exception unless the exception handler modifies the code bytes or jumps somewhere else.Wordage
If there are standard computer-architecture or operating-systems definitions for these terms, this answer would be better if it clarified the context of the definitions, and the exact meaning. As is, it's mostly just saying that hardware page faults aren't the same thing as SIGSEGV, but not very explicitly.Wordage

© 2022 - 2024 — McMap. All rights reserved.