Using MMU to implement resizable arrays
Asked Answered
U

1

15

Usually, lists are implemented either as linked lists, which are slow to traverse, or as array lists, which are slow when inserting elements.

I was wondering if it would be possible to use the processor's MMU to implement lists more efficiently, by remapping instead of copying memory whenever an element is inserted or deleted. This would mean that both indexing and insertion/deletion anywhere in the array to have an speed of O(1), better than any other list implementation.

My questions are:

  • Are programs actually able to control their own virtual memory, or would changes to the OS need to be made?
  • Is there a limit to the number of page table entries per process? Does memory access get slower with more entries?
  • Is changing page table entries so slow that this would be more efficient only for very large lists?
  • Are there any existing implementations of this type of list? If yes, what stops people from using them more?
Unhinge answered 6/1, 2017 at 3:23 Comment(7)
See as a good entry point this link: msdn.microsoft.com/en-us/library/windows/desktop/… Bottom line is that virtual memory management can be controlled by applications. But the pages which are dealt are not of arbitrary size. So, if you want to give your idea a go, go ahead and try it. But you will waste a lot of (virtual) memory if your list items are smaller than the page sizes... Anyway I doubt you will end up with something useful/performing. But cudos for being a creative spirit!Fadil
Most modern CPUs use page sizes of moderate, but substantial size. 4kb and 8kb is typical. An MMU can only be used to map memory chunks that are an even multiple of the page size. Unless your objects have precise alignment, and their size is an even multiple of the page size, the MMU is useless. And even if you have that duck lined up in a row: good luck writing your own operating system!Pellegrino
In case you are interested in latest and greatest in data structures state of the art: Look up "persistent data structures". Ideas come from functional programming but could as well be useful in C++/Rust or so system code.Fadil
uic.pure.elsevier.com/en/publications/…Mastoid
@SamVarshavchik - even if objects are much smaller than the page size, MMU tricks could be useful in principle - e.g., concatenation of two large buffers, or the circular buffer trick I linked in my answer. The best known example of an MMU trick is actually realloc - which, for large allocations, uses the MMU to effectively allow the allocated region to be extended without actually copying the underlying elements.Unfold
You shouldn't take O() notation too seriously. The difference between $O(1)$ and $O(\log n)$ can be less than the difference in the multiplicative constant for any reasonable $n$. Using the MMU means calling the OS which has a heavy cost.Lavadalavage
@StigHemmer - yes, although on some OSes the cost is actually not too bad, and makes the technique worthy of consideration. On Linux, for example, the cost seems to be a few 100 ns per call, plus about 100 ns per 4K page mapped. So 100 ns to handle a 4K page equates to about 25 GB/s, which is generally quite a bit higher than what a memcpy routine will reach, and further uses mostly CPU resources rather than the shared (with other cores) DRAM bus. Of course, it has several other downsides. That's with 4K pages - if you use 2 MB pages, you cut the manipulation needed by a factor of 500.Unfold
U
18

First some specific answers to your questions:

  • Yes, on many OSes programs have significant control over their virtual memory, e.g., mmap on UNIX-like OSes and similar APIs on Windows. Linux in particular has recently added several methods to allow advanced manipulation of user-visible buffers from the kernel without copying — but one of the interesting ones is no longer for this world (at least performance-wise).
  • Usually, there are no specific limits to the number of page table entries per-process. Of course, you might run into other limits, such as per-process memory limits, physical memory limits and so on. Access to memory generally does not get slower with more entries. Of course, access to more pages in total may imply slower access (e.g., because you exceed the size of the TLB) - but that's not directly a function of more pages. The page table entries themselves are just sitting in RAM so you can have millions of them without an issue.
  • Changing page table entries is reasonably fast on moderns OSes. For example, on my laptop, changing page table entries seems to take around ~120 ns per page (plus some fixed overhead for a system call).
  • Yes, you can find examples out there, but they generally target fairly narrow scenarios. In fact, you can see that the mach's libc tries to use use MMU tricks for no less an important routine than memcpy!

Discussion

The core problem with using MMU tricks is that (a) you can only "zero copy" whole pages, which pretty much means at a 4K granularity or larger, along with similarly restrictive alignment (b) even if mmap-type calls are fast, so are efficient memory copy routines!

Let's look first at (a). If I understand you correctly, you want to accelerate insertion into something like std::vector by using MMU tricks to shift the elements that need to be moved when an insert happens. The problem is you can only shift by 0, 4096, 8192, etc bytes on typical systems! So if you insert one 4-byte int into a vector<int> how does that help? You could perhaps "crack apart" the underlying storage of the vector into two parts at the insertion point and track that with the hope of merging them again some point (e.g, if you insert 4096 bytes worth of stuff) - but you end up with a different data structure, with different properties, and the MMU tricks aren't really fundamental here anyway.

That brings us to (b). Take for granted that on my machine a page can be remapped in ~120 ns (via mmap). That seems fast (it's not bad when you consider it involves taking various kernel locks, messing with the page tables, adding a VMA, etc) — but copying memory is also very fast. On this same box, I can copy in-memory (i.e., to/from RAM of any cache level) at about 12 GB/s, while copies in L1 or L2 proceed at perhaps 80-100 GB/s. So copying a 4K page takes somewhere between 41 ns (cached) and 340 ns (uncached, to RAM). So messing with page tables isn't a clear win even if it would be possible, especially in the cached case (and the cached case is probably the dominant one, averaging over most workloads).

So these types of tricks may make sense, but only in specific scenarios, such as the following:

  • You have some way to handle the fact that page mapping can only move/copy/shift things around in chunks of the page granularity, e.g., because your structures happen to be a multiple of page granularity, or you use batch inserts which are a multiple of page granularity, etc.
  • You have some way to map pages around more quickly: e.g., using 2MB pages rather than 4K pages, or by writing some kernel-side code that accelerates your use case.
  • You want to use even fancier tricks than simply moving memory, e.g,. making the same data appear in two places at once, implementing COW structures, or something like that.

Realloc

The most common and useful example of MMU tricks is probably realloc. On Linux and Windows (it seems?), realloc may be implemented by remapping and extending the mapped pages in memory (aka MMU tricks), which both avoids the physical copy and the need to temporarily have both the old allocated region and the new region "live" at once (which may be hard if their sum approaches the size of physical memory).

In particular, recent version of Linux will likely use mremap to realloc heap regions that were mmaped in the first place (by default, this occurs for allocation requests larger than 128K, but it may also occur when the space available to sbrk is exhausted).

Unfold answered 6/1, 2017 at 3:38 Comment(4)
Maybe in combination with persistent data structures all that could indeed be a good idea. Not for managing single array items but for managing the chunks... en.wikipedia.org/wiki/Persistent_data_structureFadil
Also invalidating TLB could be even slower on multi-processor machine and multi-threaded application.Seminal
@Fadil - for sure: the may "logical copy" operations that occur in persistent structures and the need to "trap writes" if you want to do some type of COW are both potentially good fits for MMU help. Of course, the big bucks are made in the details and there are many hurdles to cross (e.g., trapping writes and mapping them back to a user structure effectively in a multi-threaded environment, etc).Unfold
@Non-maskableInterrupt - in principle, yes. Keep in mind that mmap doesn't normally invalidate the TLB, since it is only adding entries to the process' page table, not removing them (and I'm not aware of any modern TLB that stores "negative" entries). On the other hand, munmap or mremap normally does involve some type of TLB flush, including cross-processor shootdowns if application. Still, in my tests, I measured munmap to be only half the cost of mmap, even with the implied TLB misses.Unfold

© 2022 - 2024 — McMap. All rights reserved.