First some specific answers to your questions:
- Yes, on many OSes programs have significant control over their virtual memory, e.g.,
mmap
on UNIX-like OSes and similar APIs on Windows. Linux in particular has recently added several methods to allow advanced manipulation of user-visible buffers from the kernel without copying — but one of the interesting ones is no longer for this world (at least performance-wise).
- Usually, there are no specific limits to the number of page table entries per-process. Of course, you might run into other limits, such as per-process memory limits, physical memory limits and so on. Access to memory generally does not get slower with more entries. Of course, access to more pages in total may imply slower access (e.g., because you exceed the size of the TLB) - but that's not directly a function of more pages. The page table entries themselves are just sitting in RAM so you can have millions of them without an issue.
- Changing page table entries is reasonably fast on moderns OSes. For example, on my laptop, changing page table entries seems to take around ~120 ns per page (plus some fixed overhead for a system call).
- Yes, you can find examples out there, but they generally target fairly narrow scenarios. In fact, you can see that the mach's libc tries to use use MMU tricks for no less an important routine than memcpy!
Discussion
The core problem with using MMU tricks is that (a) you can only "zero copy" whole pages, which pretty much means at a 4K granularity or larger, along with similarly restrictive alignment (b) even if mmap
-type calls are fast, so are efficient memory copy routines!
Let's look first at (a). If I understand you correctly, you want to accelerate insertion into something like std::vector
by using MMU tricks to shift the elements that need to be moved when an insert happens. The problem is you can only shift by 0, 4096, 8192, etc bytes on typical systems! So if you insert one 4-byte int
into a vector<int>
how does that help? You could perhaps "crack apart" the underlying storage of the vector
into two parts at the insertion point and track that with the hope of merging them again some point (e.g, if you insert 4096 bytes worth of stuff) - but you end up with a different data structure, with different properties, and the MMU tricks aren't really fundamental here anyway.
That brings us to (b). Take for granted that on my machine a page can be remapped in ~120 ns (via mmap
). That seems fast (it's not bad when you consider it involves taking various kernel locks, messing with the page tables, adding a VMA, etc) — but copying memory is also very fast. On this same box, I can copy in-memory (i.e., to/from RAM of any cache level) at about 12 GB/s, while copies in L1 or L2 proceed at perhaps 80-100 GB/s. So copying a 4K page takes somewhere between 41 ns (cached) and 340 ns (uncached, to RAM). So messing with page tables isn't a clear win even if it would be possible, especially in the cached case (and the cached case is probably the dominant one, averaging over most workloads).
So these types of tricks may make sense, but only in specific scenarios, such as the following:
- You have some way to handle the fact that page mapping can only move/copy/shift things around in chunks of the page granularity, e.g., because your structures happen to be a multiple of page granularity, or you use batch inserts which are a multiple of page granularity, etc.
- You have some way to map pages around more quickly: e.g., using 2MB pages rather than 4K pages, or by writing some kernel-side code that accelerates your use case.
- You want to use even fancier tricks than simply moving memory, e.g,. making the same data appear in two places at once, implementing COW structures, or something like that.
Realloc
The most common and useful example of MMU tricks is probably realloc
. On Linux and Windows (it seems?), realloc
may be implemented by remapping and extending the mapped pages in memory (aka MMU tricks), which both avoids the physical copy and the need to temporarily have both the old allocated region and the new region "live" at once (which may be hard if their sum approaches the size of physical memory).
In particular, recent version of Linux will likely use mremap
to realloc
heap regions that were mmap
ed in the first place (by default, this occurs for allocation requests larger than 128K, but it may also occur when the space available to sbrk
is exhausted).
realloc
- which, for large allocations, uses the MMU to effectively allow the allocated region to be extended without actually copying the underlying elements. – Unfoldmemcpy
routine will reach, and further uses mostly CPU resources rather than the shared (with other cores) DRAM bus. Of course, it has several other downsides. That's with 4K pages - if you use 2 MB pages, you cut the manipulation needed by a factor of 500. – Unfold