Asked 20/5, 2019 at 2:4 Answered 22/5, 2019 at 1:14

Solved linux linux-kernel x86 osdev memory-segmentation

Looking into the internals of Linux and memory management, I just stumbled upon the segmented paging model that Linux uses.

Correct me if I am wrong, but Linux (protected mode) does use paging for mapping a linear virtual address space to the physical address space. This linear address space constituted of pages, is split into four segments for the process flat memory model, namely:

The kernel code segment (__KERNEL_CS);
The kernel data segment (__KERNEL_DS);
The user code segment (__USER_CS);
The user data segment (__USER_DS);

A fifth memory segment known as the Null segment is present but unused.

These segments have a CPL (Current Privilege Level) of either 0 (supervisor) or 3 (userland).

To keep it simple, I will concentrate of the 32-bit memory mapping, with a 4GiB adressable space, 3GiB being for the userland process space (shown in green), 1GiB being for the supervisor kernel space (shown in red):

~~So the red part consists of two segments __KERNEL_CS and __KERNEL_DS, and the green part of two segments __USER_CS and __USER_DS.~~

These segments overlap each others. Paging will be used for userland and kernel isolation.

However, as extracted from Wikipedia here:

[...] many 32-bit operating systems simulate a flat memory model by setting all segments' bases to 0 in order to make segmentation neutral to programs.

Looking into the linux kernel code for the GDT here:

[GDT_ENTRY_KERNEL32_CS]       = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_CS]         = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_DS]         = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_DS]   = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_CS]   = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),

As Peter pointed out, each segment begin at 0, but what are those flags, namely 0xc09b, 0xa09b and so on ? I tend to believe they are the segments selectors, if not, how would I be able to access the userland segment from the kernel segment, if both their addressing space start at 0 ?

Segmentation is not used. Only paging is used. Segments have their seg_base addresses set 0, extending their space to 0xFFFFF and thus giving a full linear address space. That means that logical addresses are not different from linear addresses.

~~Also, since all segments overlap each others, is it the paging unit which provides memory protection (i.e. the memory separation) ?~~

Paging provide protection, not segmentation. The kernel will check the linear address space, and, according to a boundary (often known as TASK_MAX), will check the privilege level for the requested page.

Meow answered 20/5, 2019 at 2:4 Comment(16)

will check the privilege level for the requested page.. No, that's not a very good way to express it. For a userspace-supplied address, the kernel doesn't need to check whether it's user or kernel, it just needs to check it against the task's logical memory map (which the task manages with mmap and brk). Because we have a flat memory model, it's just simple integer comparisons, and kernel addresses will never be part of a task's valid virtual address space. – Brandy 20/5, 2019 at 19:7

The kernel doesn't depend on HW to signal a page fault on access to invalid pages to detect -EFAULT, so it doesn't matter whether an invalid address for user-space happens to be mapped for the kernel (e.g. calling write() on a kernel address that happens to be mapped inside the kernel). All that matters is that valid user-space addresses are still valid in kernel mode, inside a system call. – Brandy 20/5, 2019 at 19:10

Please don't keep try to edit an answer into the question. Feel free to post an answer as an answer if you have one, so people can up/down vote on it separately, and so your answer doesn't have a special place above other answers. Using strike-through on parts of the original question is kind of ok, to note misconceptions as long as the original question is still there, not invalidating existing answers. Redefining your question by adding new misconceptions creates a moving target for answers. – Brandy 20/5, 2019 at 19:15

So the kernel will just verify that the requested address will not exceed the TASK_MAX defined value, not relying on HW but a simple comparison, and emits -EFAULT according to this rule. – Meow 20/5, 2019 at 19:16

Yes @PeterCordes, I should have added my own answer to the question, I will keep that in mind when asking again. – Meow 20/5, 2019 at 19:20

lolwut, no, of course it has the check whether the address is in the range of any of the mappings for that task, not just whether it's below TASK_MAX. e.g. look at less /proc/self/maps to see the mappings for a simple process. Passing an address not part of one of those also needs to return -EFAULT, for example the address 0. – Brandy 20/5, 2019 at 19:21

You can and should edit this question now to remove the attempt to answer, and post them as an answer. You might leave the question still with strike-through and a note that explains those sections are now known to be misconceptions, so future readers aren't confused by them. – Brandy 20/5, 2019 at 19:24

So the kernel will load the page table in order to verify if the mapping is correct. The page table is in memory so yes, it just has to check against it. Am I right ? As for the edit, I’ll do that when I get back on my computer to make it clean. – Meow 20/5, 2019 at 19:27

No, the kernel keeps track of logical mappings separate from the hardware page tables. That's why not all page faults are invalid (during normal user-space execution, not inside system calls); e.g. soft and hard page faults (copy-on-write or lazy mapping, or page not present) are #PF exceptions in hardware because the PTE isn't present + valid (+ writeable), but the kernel doesn't deliver SIGSEGV; it does the copy-on-write or whatever and returns to user-space which will re-run the faulting instruction successfully. This is a "valid" page fault. – Brandy 20/5, 2019 at 19:31

So when an userland process tries to access an invalid address, it is in fact a “soft” page fault, and the kernel will compare the address against its logical page mapping, and raises -EFAULT to the process. – Meow 20/5, 2019 at 19:46

No, almost everything about that sentence is backwards and/or wrong. You get a -EFAULT return value from passing a bad address to a system call. If you actually dereference a bad pointer in userspace, e.g. mov eax, [0], it's not a hard or soft page-fault, it's an invalid page-fault and the kernel delivers a SIGSEGV signal to your process. The page-fault handler has to sort out whether it's a valid or invalid page fault by checking the address against the logical memory map, the same way the kernel does to decide to return -EFAULT or not. – Brandy 20/5, 2019 at 19:50

The kernel will check the address against its logical memory map and in case of a bad address dereference, will only send SIGSEGV to the process (which you can ignore, but then the address will not be addressed whatsoever). – Meow 20/5, 2019 at 19:56

Close, but ignoring SIGSEGV isn't useful, that creates an infinite retry loop. Why can't I ignore SIGSEGV signal?. Page fault exceptions are taken with RIP pointing at the faulting instruction, so valid page faults can re-run the instruction. Invalid page faults, and by consequence SIGSEGV signals, get the same behaviour. This is what you want anyway, because you can't reliably decode backwards to find and print the faulting instruction in a debug log; x86 machine code is variable length without synchronization markers. – Brandy 20/5, 2019 at 20:0

So the bad address dereferencing expression will be re-run by the kernel indefinitely ? Wouldn’t the kernel send an uncatchable signal like SIGKILL to terminate the process for good ? – Meow 20/5, 2019 at 20:10

No, if user-space foolishly sets SIGSEGV to SIG_IGN, the kernel doesn't special-case that. It's just a terrible idea, and not much different from catching it and returning from the handler without fixing the problem if that's what user-space chooses to do. But note that another thread could be cross-modifying the machine code of the thread stuck in a SIGSEGV-ignored loop. Or more simply, another thread could make an mmap system call that results in the memory access no longer faulting. Or if a debugger is single-stepping the faulting process with ptrace, there's no loop. – Brandy 20/5, 2019 at 20:25

cross-site near duplicate of Does Linux not use segmentation but only paging? – Brandy 21/5, 2019 at 23:47

Yes, Linux uses paging so all addresses are always virtual. (To access memory at a known physical address, Linux keeps all physical memory 1:1 mapped to a range of kernel virtual address space, so it can simply index into that "array" using the physical address as the offset. Modulo complications for 32-bit kernels on systems with more physical RAM than kernel address space.)

This linear address space constituted of pages, is split into four segments

No, Linux uses a flat memory model. The base and limit for all 4 of those segment descriptors are 0 and -1 (unlimited). i.e. they all fully overlap, covering the entire 32-bit virtual linear address space.

So the red part consists of two segments __KERNEL_CS and __KERNEL_DS

No, this is where you went wrong. x86 segment registers are not used for segmentation; they're x86 legacy baggage that's only used for CPU mode and privilege-level selection on x86-64. Instead of adding new mechanisms for that and dropping segments entirely for long mode, AMD just neutered segmentation in long mode (base fixed at 0 like everyone used in 32-bit mode anyway) and kept using segments only for machine-config purposes that are not particularly interesting unless you're actually writing code that switches to 32-bit mode or whatever.

(Except you can set a non-zero base for FS and/or GS, and Linux does so for thread-local storage. But this has nothing to do with how copy_from_user() is implemented, or anything. It only has to check that pointer value, not with reference to any segment or the CPL / RPL of a segment descriptor.)

In 32-bit legacy mode, it is possible to write a kernel that uses a segmented memory model, but none of the mainstream OSes actually did that. Some people wish that had become a thing, though, e.g. see this answer lamenting x86-64 making a Multics-style OS impossible. But this is not how Linux works.

Linux is a https://wiki.osdev.org/Higher_Half_Kernel, where kernel pointers have one range of values (the red part) and user-space addresses are in the green part. The kernel can simple dereference user-space addresses if the right user-space page-tables are mapped, it doesn't need to translate them or do anything with segments; this is what it means to have a flat memory model. (The kernel can use "user" page-table entries, but not vice versa). For x86-64 specifically, see https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt for the actual memory map.

The only reason those 4 GDT entries all need to be separate is for privilege-level reasons, and that the data vs. code segments descriptors have different formats. (A GDT entry contains more than just the base/limit; those are the parts that need to be different. See https://wiki.osdev.org/Global_Descriptor_Table)

And especially https://wiki.osdev.org/Segmentation#Notes_Regarding_C which describes how and why the GDT is typically used by a "normal" OS to create a flat memory model, with a pair of code and data descriptors for each privilege level.

For a 32-bit Linux kernel, only gs gets a non-zero base for thread-local storage (so addressing modes like [gs: 0x10] will access a linear address that depends on the thread that executes it). Or in a 64-bit kernel (and 64-bit user-space), Linux uses fs. (Because x86-64 made GS special with the swapgs instruction, intended for use with syscall for the kernel to find the kernel stack.)

But anyway, the non-zero base for FS or GS are not from a GDT entry, they're set with the wrgsbase instruction. (Or on CPUs that don't support that, with a write to an MSR).

but what are those flags, namely 0xc09b, 0xa09b and so on ? I tend to believe they are the segments selectors

No, segment selectors are indices into the GDT. The kernel is defining the GDT as a C array, using designated-initializer syntax like [GDT_ENTRY_KERNEL32_CS] = initializer_for_that_selector.

(Actually the low 2 bits of a selector, i.e. segment register value, are the current privilege level. So GDT_ENTRY_DEFAULT_USER_CS should be `__USER_CS >> 2.)

mov ds, eax triggers the hardware to index the GDT, not linear search it for matching data in memory!

GDT data format:

You're looking at x86-64 Linux source code, so the kernel will be in long mode, not protected mode. We can tell because there are separate entries for USER_CS and USER32_CS. The 32-bit code segment descriptor will have its L bit cleared. The current CS segment description is what puts an x86-64 CPU into 32-bit compat mode vs. 64-bit long mode. To enter 32-bit user-space, an iret or sysret will set CS:RIP to a user-mode 32-bit segment selector.

I think you can also have the CPU in 16-bit compat mode (like compat mode not real mode, but the default operand-size and address size are 16). Linux doesn't do this, though.

Anyway, as explained in https://wiki.osdev.org/Global_Descriptor_Table and Segmentation,

Each segment descriptor contains the following information:

The base address of the segment

The default operation size in the segment (16-bit/32-bit)

The privilege level of the descriptor (Ring 0 -> Ring 3)

The granularity (Segment limit is in byte/4kb units)

The segment limit (The maximum legal offset within the segment)

The segment presence (Is it present or not)

The descriptor type (0 = system; 1 = code/data)

The segment type (Code/Data/Read/Write/Accessed/Conforming/Non-Conforming/Expand-Up/Expand-Down)

These are the extra bits. I'm not particularly interested in which bits are which because I (think I) understand the high level picture of what different GDT entries are for and what they do, without getting into the details of how that's actually encoded.

But if you check the x86 manuals or the osdev wiki, and the definitions for those init macros, you should find that they result in a GDT entry with the L bit set for 64-bit code segments, cleared for 32-bit code segments. And obviously the type (code vs. data) and privilege level differ.

Brandy answered 20/5, 2019 at 2:46 Comment(2)

Comments are not for extended discussion; this conversation has been moved to chat. – Romberg 20/5, 2019 at 19:49

Update: 32-bit kernels may have had to use a GDT entry for set FS and GS base. I think even modern CPUs don't support the MSRs for that in 32-bit mode, and wrfs/gsbase are only supported in full 64-bit mode (not legacy or compat modes) – Brandy 19/7, 2022 at 20:11

Disclaimer

I am posting this answer to clear this topic of any misconceptions (as pointed out by @PeterCordes).

Paging

The memory management in Linux (x86 protected mode) uses paging for mapping the physical addresses to a virtualized flat linear address space, from 0x00000000 to 0xFFFFFFFF (on 32-bit), known as the flat memory model. Linux, along with the CPU's MMU (Memory Management Unit), will maintain every virtual and logical address mapped 1:1 to the corresponding physical address. The physical memory is usually split into 4KiB pages, to allow an easier management of memory.

The kernel virtual addresses can be contiguous kernel logical addresses directly mapped into contiguous physical pages; other kernel virtual addresses are fully virtual addresses mapped in not-contiguous physical pages used for large buffer allocations (exceeding the contiguous area on small-memory systems) and/or PAE memory (32-bit only). MMIO ports (Memory-Mapped I/O) are also mapped using kernel virtual addresses.

Every dereferenced address must be a virtual address. Either it is a logical or a fully virtual address, physical RAM and MMIO ports are mapped in the virtual address space prior to use.

The kernel obtains a chunk of virtual memory using kmalloc(), pointed by a virtual address, but more importantly, that is also a kernel logical address, meaning it has direct mapping to contiguous physical pages (thus suitable for DMA). On the other hand, the vmalloc() routine will return a chunk of fully virtual memory, pointed by a virtual address, but only contiguous on the virtual address space and mapped to not-contiguous physical pages.

Kernel logical addresses use a fixed mapping between physical and virtual address space. This means virtually-contiguous regions are by nature also physically contiguous. This is not the case with fully virtual addresses, which point to not-contiguous physical pages.

The user virtual addresses - unlike kernel logical addresses - do not use a fixed mapping between virtual and physical addresses, userland processes make full use of the MMU:

Only used portions of physical memory are mapped;
Memory is not-contiguous;
Memory may be swapped out;
Memory can be moved;

In more details, physical memory pages of 4KiB are mapped to virtual addresses in the OS page table, each mapping known as a PTE (Page Table Entry). The CPU's MMU will then keep a cache of each recently used PTEs from the OS page table. This caching area, is known as the TLB (Translation Lookaside Buffer). The cr3 register is used to locate the OS page table.

Whenever a virtual address needs to be translated into a physical one, the TLB will be searched. If a match is found (TLB hit), the physical address is returned and accessed. However, if there is no match (TLB miss), the TLB miss handler will look up the page table to see whether a mapping exists (page walk). If one exists, it is written back to the TLB and the faulting instruction is restarted, this subsequent translation will then find a TLB hit and the memory access will continue. This is known as a minor page fault.

Sometimes, the OS may need to increase the size of physical RAM by moving pages into the hard disk. If a virtual address resolve to a page mapped in the hard disk, the page needs to be loaded in physical RAM prior to be accessed. This is known as a major page fault. The OS page fault handler will then need to find a free page in memory.

The translation process may fail if there is no mapping available for the virtual address, meaning that the virtual address is invalid. This is known as an invalid page fault exception, and a segfault will be issued to the process by the OS page fault handler.

Memory segmentation

Real mode

Real mode still uses a 20-bit segmented memory address space, with 1MiB of addressable memory (0x00000 - 0xFFFFF) and unlimited direct software access to all addressable memory, bus addresses, PMIO ports (Port-Mapped I/O) and peripheral hardware. Real mode provides no memory protection, no privilege levels and no virtualized addresses. Typically, a segment register contains the segment selector value, and the memory operand is an offset value relative to the segment base.

To work around segmentation (C compilers usually only support the flat memory model), C compilers used the unofficial far pointer type to represent a physical address with a segment:offset logical address notation. For instance, the logical address 0x5555:0x0005, after computing 0x5555 * 16 + 0x0005 yields the 20-bit physical address 0x55555, usable in a far pointer as shown below:

char far    *ptr;           /* declare a far pointer */
ptr = (char far *)0x55555;  /* initialize a far pointer */

As of today, most modern x86 CPUs still start in real mode for backwards compatibility and switch to protected mode thereafter.

Protected mode

In protected mode, with the flat memory model, segmentation is unused. The four segments, namely __KERNEL_CS, __KERNEL_DS, __USER_CS, __USER_DS all have their base addresses set to 0. These segments are just legacy baggage from the former x86 model where segmented memory management was used. In protected mode, since all segments base addresses are set to 0, logical addresses are equivalent to linear addresses.

Protected mode with the flat memory model means no segmentation. The only exception where a segment has its base address set to a value other than 0 is when thread-local storage is involved. The FS (and GS on 64-bit) segment registers are used for this purpose.

However, segment registers such as SS (stack segment register), DS (data segment register) or CS (code segment register) are still present and used to store 16-bit segment selectors, which contain indexes to segment descriptors in the LDT and GDT (Local & Global Descriptor Table).

Each instruction that touches memory implicitly uses a segment register. Depending on the context, a particular segment register is used. For instance, the JMP instruction uses CS while PUSH uses SS. Selectors can be loaded into registers with instructions like MOV, the sole exception being the CS register which is only modified by instructions affecting the flow of execution, like CALL or JMP.

The CS register is particularly useful because it keeps track in of the CPL (Current Privilege Level) in its segment selector, thus conserving the privilege level for the present segment. This 2-bit CPL value is always equivalent to the CPU current privilege level.

Memory protection

Paging

The CPU privilege level, also known as the mode bit or protection ring, from 0 to 3, restricts some instructions that can subvert the protection mechanism or cause chaos if allowed in user mode, so they are reserved to the kernel. An attempt to run them outside of ring 0 causes a general-protection fault exception, same scenario when a invalid segment access error occurs (privilege, type, limit, read/write rights). Likewise, any access to memory and MMIO devices is restricted based on privilege level and every attempt to access a protected page without the required privilege level will cause a page fault exception.

The mode bit will be automatically switched from user mode to supervisor mode whenever an interrupt request (IRQ), either software (ie. syscall) or hardware, occurs.

On a 32-bit system, only 4GiB of memory can be effectively addressed, and the memory is split in a 3GiB/1GiB form. Linux (with paging enabled) uses a protection schema known as the higher half kernel where the flat addressing space is divided into two ranges of virtual addresses:

Addresses in the range 0xC0000000 - 0xFFFFFFFF are kernel virtual addresses (red area). The 896MiB range 0xC0000000 - 0xF7FFFFFF directly maps kernel logical addresses 1:1 with kernel physical addresses into the contiguous low-memory pages (using the __pa() and __va() macros). The remaining 128MiB range 0xF8000000 - 0xFFFFFFFF is then used to map virtual addresses for large buffer allocations, MMIO ports (Memory-Mapped I/O) and/or PAE memory into the not-contiguous high-memory pages (using ioremap() and iounmap()).
Addresses in the range 0x00000000 - 0xBFFFFFFF are user virtual addresses (green area), where userland code, data and libraries reside. The mapping can be in not-contiguous low-memory and high-memory pages.

High-memory is only present on 32-bit systems. All memory allocated with kmalloc() has a logical virtual address (with a direct physical mapping); memory allocated by vmalloc() has a fully virtual address (but no direct physical mapping). 64-bit systems have a huge addressing capability hence does not need high-memory, since every page of physical RAM can be effectively addressed.

The boundary address between the supervisor higher half and the userland lower half is known as TASK_SIZE_MAX in the Linux kernel. The kernel will check that every accessed virtual address from any userland process resides below that boundary, as seen in the code below:

static int fault_in_kernel_space(unsigned long address)
{
    /*
     * On 64-bit systems, the vsyscall page is at an address above
     * TASK_SIZE_MAX, but is not considered part of the kernel
     * address space.
     */
    if (IS_ENABLED(CONFIG_X86_64) && is_vsyscall_vaddr(address))
        return false;

    return address >= TASK_SIZE_MAX;
}

If an userland process tries to access a memory address higher than TASK_SIZE_MAX, the do_kern_addr_fault() routine will call the __bad_area_nosemaphore() routine, eventually signaling the faulting task with a SIGSEGV (using get_current() to get the task_struct):

/*
 * To avoid leaking information about the kernel page table
 * layout, pretend that user-mode accesses to kernel addresses
 * are always protection faults.
 */
if (address >= TASK_SIZE_MAX)
    error_code |= X86_PF_PROT;

force_sig_fault(SIGSEGV, si_code, (void __user *)address, tsk); /* Kill the process */

Pages also have a privilege bit, known as the User/Supervisor flag, used for SMAP (Supervisor Mode Access Prevention) in addition to the Read/Write flag that SMEP (Supervisor Mode Execution Prevention) uses.

Segmentation

Older architectures using segmentation usually perform segment access verification using the GDT privilege bit for each requested segment. The privilege bit of the requested segment, known as the DPL (Descriptor Privilege Level), is compared to the CPL of the current segment, ensuring that CPL <= DPL. If true, the memory access is then allowed to the requested segment.

Meow answered 22/5, 2019 at 1:14 Comment(16)

I/O ports: no, I/O address space (in / out instructions) is separate and not virtualized. There'd be no point because the actual final address is what determines which device you access! For MMIO addresses (in physical memory address space), yes you have to work around paging to access the physical address you want. (With pages mapped to the right physical addresses). – Brandy 22/5, 2019 at 1:44

x86 does page walks in hardware. What you're describing as a "minor page fault" is actually just a TLB miss. Replaying a load uop happens, but that's very much an implementation detail! It's not visible to software (other than performance counters). An actual minor page fault is when the page walk doesn't find valid PTE, and has to raise a #PF exception, but the kernel doesn't have to wait for disk I/O to resolve it. (e.g. just a copy-on-write of a newly allocated page, for example.) Read Can a page fault handler generate more page faults? again – Brandy 22/5, 2019 at 1:49

I was referring to MMIO (Memory-Mapped I/O), I don't really know about dedicated bus I/O ports, apart that they use special instructions. Do you have more informations ? – Meow 22/5, 2019 at 1:51

I don't understand what you mean with TLB miss, I mentioned that in my answer too. – Meow 22/5, 2019 at 1:55

"I/O port" means in / out instructions, not MMIO. The instructions are felixcloutier.com/x86/IN.html and out. See also opensecuritytraining.info/IntroBIOS_files/… which looks like a decent set of slides that I skimmed for a minute. It looks like it goes into more detail about the separate IO vs. memory address spaces, and hopefully that in/out generate PCI transactions in the PCI I/O address space. (PCI has 3 address spaces: memory, IO, and config). There isn't a separate bus. – Brandy 22/5, 2019 at 1:58

re: TLB miss: you say "If one exists" ... instruction restarted ... "This is known as a minor page fault." No, it's not known as anything. The OS is not involved with this operation at all, it's pure hardware and invisible to software (except via performance counters). A page fault means there was a #PF exception that runs x86 instructions in the OS's page-fault handler function. Page walks are done in hardware, and can be done speculatively / out-of-order. – Brandy 22/5, 2019 at 2:1

So what are true minor/major #PF exceptions ? Can you provide examples ? I will correct my answer then. – Meow 22/5, 2019 at 2:2

I already explained this in comments on my answer, and a couple minutes ago in a comment. But here it is again since you can't be bothered to google major minor page fault and find en.wikipedia.org/wiki/Page_fault#Types. A minor page fault is a #PF that didn't have to wait for disk I/O, e.g. just a copy-on-write. A major page fault is one where the data for that page is on disk, not in memory anywhere, so the process has to block until it's paged in. – Brandy 22/5, 2019 at 2:5

Why would this #PF exception be visible to programs then ? Will the current task be signaled or whatever ? – Meow 22/5, 2019 at 2:6

No, valid page faults aren't visible to user-space, only invalid page faults result in delivery of SIGSEGV. Valid page faults are only visible to the kernel. (And software using time or perf to find out how many page faults occurred during execution.) How is that not obvious from reading the Wikipedia article? – Brandy 22/5, 2019 at 2:9

Frankly speaking, I see no difference with my explanation of the minor page fault: "If one exists, it is written back to the TLB and the faulting instruction is restarted, this subsequent translation will then find a TLB hit and the memory access will continue.", and the Wikipedia one: "The page fault handler in the operating system merely needs to make the entry for that page in the memory management unit point to the page in memory and indicate that the page is loaded in memory". This is almost the same thing, apart from language sugar. – Meow 22/5, 2019 at 2:15

If a hardware page walk finds a valid PTE, no page fault happens in the first place. x86 resolves TLB misses in hardware, only missing PTEs cause a software page fault. Out-of-order execution can continue while a page-walk is happening, but not around a #PF. See Can a page fault handler generate more page faults? and What happens after a L2 TLB miss?/ – Brandy 22/5, 2019 at 2:21

What do you mean by “copy-on-write” ? Copying a page to another location ? Or copying a PTE entry ? Btw, I edited my answer, you can review it and tell me if I am wrong, so I can correct mistakes. I already corrected the MMIO misinterpretation with actual I/O ports. – Meow 22/5, 2019 at 17:1

en.wikipedia.org/wiki/Copy-on-write. Google terms you aren't familiar with. – Brandy 22/5, 2019 at 17:11

So the only kind of minor page fault exceptions are when the copy-on-write mechanism is involved, as when forking a process ? Also, what are your thoughts on my edited answer, do I mention everything correctly and don’t mislead people reading ? – Meow 22/5, 2019 at 17:26

No, COW isn't the only reason, the kernel's lazy allocation doesn't wire up new mappings into the hardware PTEs right away. That's what MAP_POPULATE is for. There's just too much to explain, and SO comments aren't the right place for writing a tutorial. Your answer is mostly ok, I think, but focuses on and emphasizes odd things and presents things in an odd way. I don't have the time or interest to read it in a lot of detail, sorry. – Brandy 22/5, 2019 at 17:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

GDT data format:

Disclaimer

Paging

Memory segmentation

Real mode

Protected mode

Memory protection

Paging

Segmentation

Recommended topics

Hot tags