Linux mremap without freeing the old mapping?

Asked 19/9, 2013 at 20:42 Answered 22/4, 2020 at 23:4

I need a way to copy pages from one virtual address range to another without actually copying the data. The ranges are massive and latency is important. mremap can do this, but the problem is it also deletes the old mapping. Since I need to do this in a multithreaded environment I need the old mapping to be simultaneously usable, I will free it later when I'm certain no other threads can be using it. Is this possible, however hacky, without modifying the kernel? The solution only need work with recent Linux kernels.

Cinthiacintron answered 19/9, 2013 at 20:42 Comment(6)

Out of curiosity, why bother with remap if the memory is still being accessed at the old address? – Jiujitsu 19/9, 2013 at 20:44

Is there another mechanism to map the same pages to a new, larger mapping at a different address? That would answer the question. – Cinthiacintron 19/9, 2013 at 20:49

I don't know the answer to your question (AFAIK it's impossible), but I'm curious as to why you need that. After all, if you only need to enlarge the mapping, that should be possible without relocating it, as long as you keep adjacent mapping well-spaced, which should not be a problem unless you're running a 32-bit kernel. – Jiujitsu 19/9, 2013 at 20:52

Difficult and unreliable with potentially hundreds of thousands of large mappings on x64. I'd rather modify the kernel. These mappings are actually mapped shm_open names. I think it might be possible to extend the mapping with ftruncate and then map the new larger region in an overlapping new mmap in the same process. Is that possible? – Cinthiacintron 19/9, 2013 at 21:3

Are you mapping an actual file, or an anonymous region? I think if it's a file, then you can use mmap the same file again and get a different set of addresses. – Joellyn 19/9, 2013 at 21:56

Btw, the kernel has added support for this since 5.7, see MREMAP_DONTUNMAP – Cinthiacintron 2/2 at 17:56

It is possible, although there are architecture-specific cache consistency issues you may need to consider. Some architectures simply do not allow the same page to be accessed from multiple virtual addresses simultaneously without losing coherency. So, some architectures will manage this fine, others do not.

Edited to add: AMD64 Architecture Programmer's Manual vol. 2, System Programming, section 7.8.7 Changing Memory Type, states:

A physical page should not have differing cacheability types assigned to it through different virtual mappings; they should be either all of a cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC, CD). Otherwise, this may result in a loss of cache coherency, leading to stale data and unpredictable behavior.

Thus, on AMD64, it should be safe to mmap() the same file or shared memory region again, as long as the same prot and flags are used; it should cause the kernel to use the same cacheable type to each of the mappings.

The first step is to always use a file backing for the memory maps. Use mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0) so that the mappings do not reserve swap. (If you forget this, you'll run into swap limits much sooner than you hit actual real life limits for many workloads.) The extra overhead caused by having a file backing is absolutely neglible.

Edited to add: User strcmp pointed out that current kernels do not apply address space randomization to the addresses. Fortunately, this is easy to fix, by simply supplying randomly generated addresses to mmap() instead of NULL. On x86-64, the user address space is 47-bit, and the address should be page aligned; you could use e.g. Xorshift* to generate the addresses, then mask out the unwanted bits: & 0x00007FFFFE00000 would give 2097152-byte-aligned 47-bit addresses, for example.

Because the backing is to a file, you can create a second mapping to the same file, after enlarging the backing file using ftruncate(). Only after a suitable grace period -- when you know no thread is using the mapping anymore (perhaps use an atomic counter to keep track of that?) --, you unmap the original mapping.

In practice, when a mapping needs to be enlarged, you first enlarge the backing file, then try mremap(mapping, oldsize, newsize, 0) to see if the mapping can be grown, without moving the mapping. Only if the in-place remapping fails, do you need to switch to the new mapping.

Edited to add: You definitely do want to use mremap() instead of just using mmap() and MAP_FIXED to create a larger mapping, because mmap() unmaps (atomically) any existing mappings, including those belonging to other files or shared memory regions. With mremap(), you get an error if the enlarged mapping would overlap with existing mappings; with mmap() and MAP_FIXED, any existing mappings that the new mapping overlaps are ignored (unmapped).

Unfortunately, I must admit I haven't verified if the kernel detects collisions between existing mappings, or if it just assumes the programmer knows about such collisions -- after all, the programmer must know the address and length of every mapping, and therefore should know if the mapping would collide with anther existing one. Edited to add: The 3.8 series kernels do, returning MAP_FAILED with errno==ENOMEM if the enlarged mapping would collide with existing maps. I expect all Linux kernels to behave the same way, but have no proof, aside from testing on 3.8.0-30-generic on x86_64.

Also note that in Linux, POSIX shared memory is implemented using a special filesystem, typically a tmpfs mounted at /dev/shm (or /run/shm with /dev/shm being a symlink). The shm_open() et. al are implemented by the C library. Instead of having a large POSIX shared memory capability, I'd personally use a specially mounted tmpfs for use in a custom application. If not for anything else, the security controls (users and groups able to create new "files" in there) are much easier and clearer to manage.

If the mapping is, and has to be, anonymous, you can still use mremap(mapping, oldsize, newsize, 0) to try and resize it; it just may fail.

~~Even with hundreds of thousands of mappings, the 64-bit address space is vast, and the failure case rare. So, although you must handle the failure case too, it does not necessarily have to be fast.~~ Edited to modify: On x86-64, the address space is 47-bit, and mappings must start at a page boundary (12 bits for normal pages, 21 bits for 2M hugepages, and 30 bits for 1G hugepages), so there is only 35, 26, or 17 bits available in the address space for the mappings. So, the collisions are more frequent, even if random addresses are suggested. (For 2M mappings, 1024 maps had an occasional collision, but at 65536 maps, the probability of a collision (resize failure) was about 2.3%.)

Edited to add: User strcmp pointed out in a comment that by default Linux mmap() will return consecutive addresses, in which case growing the mapping will always fail unless it's the last one, or a map was unmapped just there.

The approach I know works in Linux is complicated and very architecture-specific. You can remap the original mapping read-only, create a new anonymous map, and copy the old contents there. You need a SIGSEGV handler (SIGSEGV signal being raised for the particular thread that tries to write to the now read-only mapping, this being one of the few recoverable SIGSEGV situations in Linux even if POSIX disagrees) that examines the instruction that caused the problem, simulates it (modifying the contents of the new mapping instead), and then skips the problematic instruction. After a grace period, when there are no more threads accessing the old, now read-only mapping, you can tear down the mapping.

All of the nastiness is in the SIGSEGV handler, of course. Not only must it be able to decode all machine instructions and simulate them (or at least those that write to memory), but it must also busy-wait if the new mapping has not been completely copied yet. It is complicated, absolutely unportable, and very architecture-specific.. but possible.

Quinsy answered 19/9, 2013 at 22:25 Comment(16)

I'm having trouble googling x64 virtual address aliasing and whether it will work or not, do you know? The problem is I've got up to 1000 processes, each of which use ridiculous amounts of virtual address space, on a box with 256GB of ram and only a piddling 512GB of disk, half of which is earmarked for other things. So file backed is not possible, but afaik the shm_open handles work like nameable, shareable mappings. Since they're nameable you can try to map the same one in the same process (about to test what the kernel does with that.) – Cinthiacintron 19/9, 2013 at 23:25

Failing that I much prefer changing mremap in the kernel (adding an extra flag to keep the old mapping.) I'd imagine that's relatively easy and may even get accepted upstream, but then my imagination doesn't always intersect with reality. – Cinthiacintron 19/9, 2013 at 23:25

mremap is not needed at all, the shm_open fd can be mapped, extended with ftruncate, and then mapped again in a new overlapping mapping. You get two seperate virtual addresses and writing to one and reading from the other works just fine. Please update your answer to make this clear for anyone who finds this answer later, and I will accept it. – Cinthiacintron 19/9, 2013 at 23:53

@Eloff, wrt. address aliasing and cache coherency: According to AMD64 Architecture Programmer's Manual vol. 2, section 7.8.7, multiple virtual mappings to the same pages will work fine as long as they are all either cacheable (writeback, writethrough, or write-protected), or noncacheable (uncached, write-combining, or caching disabled); not a mix. – Quinsy 20/9, 2013 at 0:16

@Eloff: Each process has their own 64-bit virtual address space. Each map will likely reside at different virtual addresses for each process. In Linux, POSIX shared memory (shm_open() et. al) is implemented via /dev/shm or /run/shm, typically a tmpfs filesystem. If I were you, I'd use a dedicated tmpfs for the file "backing", not POSIX shared memory. The mappings use the page cache pages, so there is no duplication anyway. – Quinsy 20/9, 2013 at 0:38

@Eloff, wrt. no mremap(): mmap() with MAP_FIXED flag will happily overwrite ALL existing maps, even those belonging to other files (or shared memory regions, which are basically the same thing in Linux). So, if you use mmap(addr,newsize,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_NORESERVE|MAP_FIXED,fd,0) to "grow" the region, you won't get an error if the new region would overlap with another mapping; it will just succeed. On the other hand, with mremap(), you get an error if the enlarged mapping collides with another mapping. So, you do need to use mremap(). – Quinsy 20/9, 2013 at 0:49

@Animal I was thinking of specifying mmap(NULL, RW, MAP_SHARED | MAP_NORESERVE, fd, 0) which will give you a new virtual address which shares pages with the previous mapping. When done with the old address it can safely be unmapped without affecting the new mapping. Btw, whats wrong with POSIX shared memory vs dedicated tmpfs, it should be the same thing on Linux? – Cinthiacintron 3/10, 2013 at 0:4

@Eloff: Right, seems fine to me. There is nothing wrong with POSIX shared memory per se, it's just that since it is implemented via tmpfs on Linux, dedicating a separate tmpfs for your application gives the sysadmin more detailed control. See the tmpfs entries in /etc/fstab, size attribute for example. For app-specific tmpfs, you can set also uid, gid, mode. This way only specific users can utilize the tmpfs at all. When your service runs as a specific user, you can easily control the resources dedicated to it. As a sysadmin, I find such controls very useful, that's all. – Quinsy 3/10, 2013 at 0:51

The kernel picks a base address and then creates mappings in the gap at the lowest address. Using mremap without MREMAP_MAYMOVE will almost always fail because everything is tightly packed and growing up will hit the old mappings. It can only succeed when another mapping was unmapped there, fragmenting the virtual address space. The useful thing about mremap is that it can perform moves via copying the page tables rather than copying data, which is much faster. – Trammel 3/2, 2015 at 21:56

@strcat: And that is the reason you downvoted my answer? Did you even read the original question? You pointed out a detail that is easily fixed, thanks for that. I'll amend the answer to reflect that. – Quinsy 4/2, 2015 at 4:15

ASLR does apply to mmap: it randomizes the base. Spreading out mappings over the address space will cause a significant performance hit along with pathologically fragmenting the address space into ever smaller gaps which could easily lead to OOM even on 64-bit when a large mapping is requested. – Trammel 4/2, 2015 at 11:48

@strcat: Don't be stupid, please, and read the original question. You downvoted my answer because you found an irrelevant detail in it objectionable, instead of anything relevant to it as an answer to the posed question. Virtual address space fragmentation is the least worry the OP has. For 4k pages, your "significant performance hit" is roughly 10% worst-case (measured on x86-64 using 1024 maps, full 47-bit address space randomization). For huge pages, the performance hit is smaller (as the page tables are much smaller). – Quinsy 5/2, 2015 at 7:21

I'm well aware of the original question. I didn't downvote your answer because of one technical error. I just don't feel like elaborating on what I think is wrong with it. – Trammel 5/2, 2015 at 11:31

I don't know how you're measuring the performance hit, but it certainly bigger than 10% for many use cases, especially for a memory bound workload which is probably the case Virtual memory fragmentation is hard to dismiss when they're stating that they have many very large mappings. Randomly spraying the address space with allocations is going to rapidly bring down the size of the largest gap. – Trammel 5/2, 2015 at 11:34

@strcat: I measured the time (both CLOCK_MONOTONIC and CLOCK_PROCESS_CPUTIME_ID clocks) for filling the first page of each mapping a few times using memset() after all mappings are established, including the initial page hit, and compared the times for 1024 mappings, 2M each, spread around the 47-bit address space. No large latencies were detected, and the variance in timing was in the 4-8% range. – Quinsy 5/2, 2015 at 17:17

@strcat: Instead of just downvoting because of your "feelings", perhaps you should instead consider offering your own answer? Frankly, I see downvoting an answer if you are incapable of pointing out the issues as either useless or petty. Have I pissed you off somehow? The only reason I care is that I care about the quality of my answers -- I never vote on other answers myself --, and am always ready to admit to my error, and try to fix it. I kinda liked having only two downvotes among my 200 answers thus far. Yours is the third, and the first that I cannot understand at all. – Quinsy 5/2, 2015 at 17:26

Yes, you can do this.

mremap(old_address, old_size, new_size, flags) deletes the old mapping only of the size "old_size". So if you pass 0 as "old_size", it will not unmap anything at all.

Caution: this works as expected only with shared mappings, so such mremap() should be used on a region previously mapped with MAP_SHARED. This is actually all of that, i.e. you don't even need a file-backed mapping, you can successfully use "MAP_SHARED | MAP_ANONYMOUS" combination for mmap() flags. Some very old OSes may not support "MAP_SHARED | MAP_ANONYMOUS", but on linux you are safe.

If you try that on a MAP_PRIVATE region, the result would be roughly similar to memcpy(), i.e. no memory alias will be created. But it will still use the CoW machinery. It is not clear from your initial question whether do you need an alias, or the CoW copy is fine too.

UPDATE: for this to work, you also need to specify the MREMAP_MAYMOVE flag obviously.

Carmelocarmen answered 2/3, 2016 at 11:34 Comment(4)

“if you pass 0 as "old_size", it will not unmap anything at all” — that's technically correct, but it also means that no virtual address will be remapped to begin with, thus making the call to mremap useless and leaving OP's problem unsolved. Pages are indeed moved from [old_address, old_address + min(old_size, new_size)] to [new_address, new_address + min(old_size, new_size)]. – Publicspirited 22/7, 2016 at 10:29

Please check your facts. What you say is simply wrong and any small test-case can confirm this. – Carmelocarmen 23/7, 2016 at 12:53

It is also a bit unclear what dos the OP really want: does he need a memory alias, or does he need a CoW copy of the original region? – Carmelocarmen 23/7, 2016 at 13:1

This (being able to remap MAP_PRIVATE) is a good feature but seems to be no longer supported because it is regarded as a bug: man7.org/linux/man-pages/man2/mremap.2.html – Acculturize 10/12, 2019 at 17:43

This was added in the 5.7 kernel as a new flag to mremap(2) called MREMAP_DONTUNMAP. This leaves the existing mapping in place after moving the page table entries.

See https://github.com/torvalds/linux/commit/e346b3813067d4b17383f975f197a9aa28a3b077#diff-14bbdb979be70309bb5e7818efccacc8

Caruthers answered 22/4, 2020 at 23:4 Comment(1)

What is the advantage over setting 0 as old_size to mremap? – Carmelocarmen 11/6, 2020 at 17:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags