How remap_pfn_range remaps kernel memory to user space?

Asked 9/1, 2012 at 12:15 Answered 13/6, 2024 at 13:48

linux-kernel linux-device-driver kernel-module virtual-address-space

remap_pfn_range function (used in mmap call in driver) can be used to map kernel memory to user space. How is it done? Can anyone explain precise steps? Kernel Mode is a privileged mode (PM) while user space is non privileged (NPM). In PM CPU can access all memory while in NPM some memory is restricted - cannot be accessed by CPU. When remap_pfn_range is called, how is that range of memory which was restricted only to PM is now accessible to user space?

Looking at remap_pfn_range code there is pgprot_t struct. This is protection mapping related struct. What is protection mapping? Is it the answer to above question?

Perdure answered 9/1, 2012 at 12:15 Comment(0)

It's simple really, kernel memory (usually) simply has a page table entry with the architecture specific bit that says: "this page table entry is only valid while the CPU is in kernel mode".

What remap_pfn_range does is create another page table entry, with a different virtual address to the same physical memory page that doesn't have that bit set.

Usually, it's a bad idea btw :-)

Arris answered 9/1, 2012 at 12:41 Comment(2)

Does the kernel remove this pte once the map is no longer required? How is this cleaned up? – Evanthe 1/12, 2022 at 15:37

@sham1810: mmap returns a raw pointer. Therefore I would say either if the process calls munmap() or exits, aborts or gets killed. – Insoluble 13/6, 2024 at 13:14

The core of the mechanism is page table MMU:

or this:

Related image

Both picture above are characteristics of x86 hardware memory MMU, nothing to do with Linux kernel.

Below described how the VMAs is linked to the process's task_struct:

_{(source: slideplayer.com)}

And looking into the function itself here:

http://lxr.free-electrons.com/source/mm/memory.c#L1756

The data in physical memory can be accessed by the kernel through the kernel's PTE, as shown below:

_{(source: tldp.org)}

But after calling remap_pfn_range() a PTE (for an existing kernel memory but to be used in userspace to access it) is derived (with different page protection flags). The process's VMA memory will be updated to use this PTE to access the same memory - thus minimizing the need to waste memory by copying. But kernel and userspace PTE have different attributes - which is used to control the access to the physical memory, and the VMA will also specified the attributes at the process level:

vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

Proton answered 31/1, 2012 at 7:23 Comment(4)

"part of that coincide with that of kernel's page table, which is NOT duplicated for each process" when you say that do you mean there is only one page-table copy for the kernel mapping that is used by all processes? Could you please elaborate more on how that could be done? – Winer 2/9, 2015 at 2:9

Perhaps read this: turkeyland.net/projects/overflow/intro.php and from the picture you can see that one process ONE set of page tables, whose base address will be loaded into the CR3 register. And for all those virtual addresses (kernel specifically) that is to be shared among different process, all these will have the same value pointing to the same physical page. hope that clear up. – Proton 3/9, 2015 at 2:47

How does one hold the "mm semaphore"? – Nineteenth 29/12, 2015 at 20:3

This global variable is per-process, but multiple concurrent threads inside the process may acquire it, and thus locking is necessary via up_read() or down_read(). – Proton 20/11, 2016 at 0:20

The internal bookkeeping with PTE objects explained by Peter Teoh 2012 is just how it is, so that the Linux kernel runs on various, current hardware. But the TO asked specifically about (1) character device drivers and (2) memory protection and pgprot_t objects.

When a user land process accesses memory it does not own the hardware creates a page fault. The kernel catches it and in do_page_fault either makes the page available to the process, or kills it to prevent damage because there clearly is a programming error.

We can prevent that by remapping kernel virtual addresses (e.g. converted from known physical addresses, static memory or memory allocated by kmalloc, get_free_page and friends) to the user space process. But when and where exactly? The TO mentioned a character driver that implements mmap, which is a function pointer in the file_operations object that the driver fills in when it loads. When a user land process calls mmap from the Standard C Library passing it the file path of the device driver as an argument, the kernel calls the registered mmap function.

It is within that function that remap_pfn_range is called. And the TO’s question was how exactly this works: "Can anyone explain precise steps?"

Well, this is where words start to fail and we have to turn to source code. Here is the mmap implementation of /dev/mem by Torvalds himself. As you can see it's basically a wrapper around remap_pfn_range. Unfortunately, if you look into the code you don't see any magic happening, just more bookkeeping. It reserves the page for the user process and modifies the pgprot_t value.

This is probably how it works: the first access to the mapped memory from the process initially generates a page fault, but now do_page_fault knows which process the page belongs to.

To return the page to the kernel, the process calls munmap.

This is a complex topic and so it would be good if someone would confirm/criticize/expand on my statements.

Insoluble answered 13/6, 2024 at 13:48 Comment(0)

Recommended topics

Hot tags