Is it possible to map a process into memory without mapping the kernel?
Asked Answered
C

1

7

The OSDev wiki says that:

It is traditional and generally good to have your kernel mapped in every user process

Why is that though? Can't the process be mapped to memory solely? What are the advantages of mapping the kernel and wouldn't that be a waste of space?

Also, is it possible to access the kernel space from the user space and why would I do that?

Carboni answered 20/10, 2017 at 10:57 Comment(0)
Y
13

It is traditional and generally good to have your kernel mapped in every user process

So when you make a system call, the kernel doesn't have to change the page tables to access it's own memory. Having all physical memory mapped all the time makes it cheaper for a read system call to copy stuff from anywhere in the pagecache, for example.

The GDT and IDT base addresses are virtual (lidt / lgdt) so interrupt handling requires that at least the page containing the IDT, and the interrupt-handler code it points to, are mapped while user-space is executing.

But as mitigation for Meltdown on Intel CPUs where user-space speculative reads can bypass the user/supervisor page-table permission bit, Linux does actually unmap most of the kernel while user-space executes. It needs to keep a "trampoline" mapped that swaps page tables to remap the kernel proper before jumping to the regular entry points, so interrupt handlers and system calls can work.

is it possible to access the kernel space from the user space and why would I do that?

Usually the kernel would disable this. Page table entries have a user/supervisor bit which controls whether it can be used when not in kernel mode (i.e. ring 3, I think). The kernel can thus leave its memory mapped while still protecting it from read/write by user-space. (See also this for a diagram of nesting of page directories.)

CPUs have a performance feature to support this use-case: there's a "global" bit in each PTE that (if set) means the CPU can keep it cached in the TLB even when CR3 changes (i.e. across context switches, when the kernel installs a new page table). The kernel sets this for the kernel mappings that it includes in every process.

And BTW, there's probably only one physical copy of the tables for those kernel mappings, with the top-level Page Map Level 4 Table (PML4) for each different tree of user-space page tables simply pointing to the same kernel PDPTE structures (most/all of which are actually 1GiB hugepage mappings, rather than pointers to further levels of entries). See the diagram linked above.


There is actually a small amount of memory that the kernel allows user-space to read (and execute): The kernel maps a few 4k pages called the VDSO area into the address space of every process (at the very top of virtual memory).

For a few simple but common system calls like gettimeofday() and getpid(), user-space can call functions in these pages (which for example run rdtsc and scale the result by constants exported by the kernel) instead of using syscall to enter kernel mode and do the same thing there. This saves maybe 50 to 100 clock cycles for a round-trip to kernel mode on a modern x86 CPU, and more from not needing all the save/restore of stuff inside the kernel before dispatching to the right system call.


Is it possible to map a process into memory without mapping the kernel?

With a 32-bit process on a 64-bit kernel, the entire 4GiB virtual address space is available for user-space. (Except for 3 or so 4k VDSO pages.)

Otherwise (when user-space virtual addresses are as wide as kernel-space virtual addresses) Linux uses the upper half for kernel mapping of all physical memory (with 1G hugepages on x86).

i386 Linux has a config options to make the split 1:3, IIRC, further cramping the kernel but allowing more virtual address space for user-space processes. IDK if this is common for 32-bit kernels on other architectures, or only x86.

wouldn't that be a waste of space?

It takes up some virtual address space, but you're supposed to have more of that than you do physical memory. If you don't, you have to pay the speed cost of remapping memory more often.

This is why we have x86-64, so virtual address space is huge. 48 bits is 256 TiB, so half of that is 128 TiB of address space. Future CPUs could implement hardware support for wider virtual addresses if it becomes necessary / useful. (The page table format supports up to 52 bit physical addresses.). Maybe this will become more of an issue with non-volatile DIMMs providing memory-mapped storage with higher density than DRAM, and a reason to use a lot of both kinds of address space.

If you need more than 2GiB of virtual address space in a single process, use a 64-bit system. (Or if you need a zillion processes / threads, use a 64-bit kernel at least. A 32-bit kernel with PAE runs into memory-allocation problems sometimes. See some https://serverfault.com/ questions.)

Someone reposted on their blog some of Linus Torvalds' comments about PAE (Physical Address Extensions) which allows having more than 4GB of physical memory on a 32-bit-only x86 system. Summary: yuck, even with a good kernel-side implementation, it's definitely slower than a 64-bit kernel. Except with more amusing insults at the Intel engineers who thought it would be a good idea and solve the problem for 32-bit OSes.

Yorke answered 20/10, 2017 at 11:34 Comment(3)
" i.e. that every process has this same page-table entry", can you elaborate?Carboni
@Trey: It means the CPU can keep the TLB entry cached even when CR3 changes. If you don't understand that, read about it on wiki.osdev.org or other stuff (like Intel's manuals) in the [x86 tag wiki](stackoverflow.com/tags/x86/info). The implications for how an OS could use that should be obvious.Yorke
@Trey: I forgot something: VDSO pages are actually exported from the kernel for user space to read/execute. Updated my answer again. (I also reworded that paragraph about the global bit to make it clearer it's just a performance boost for this use-case.)Yorke

© 2022 - 2024 — McMap. All rights reserved.