I need a way to copy pages from one virtual address range to another without actually copying the data. The ranges are massive and latency is important. mremap can do this, but the problem is it also deletes the old mapping. Since I need to do this in a multithreaded environment I need the old mapping to be simultaneously usable, I will free it later when I'm certain no other threads can be using it. Is this possible, however hacky, without modifying the kernel? The solution only need work with recent Linux kernels.
It is possible, although there are architecture-specific cache consistency issues you may need to consider. Some architectures simply do not allow the same page to be accessed from multiple virtual addresses simultaneously without losing coherency. So, some architectures will manage this fine, others do not.
Edited to add: AMD64 Architecture Programmer's Manual vol. 2, System Programming, section 7.8.7 Changing Memory Type, states:
A physical page should not have differing cacheability types assigned to it through different virtual mappings; they should be either all of a cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC, CD). Otherwise, this may result in a loss of cache coherency, leading to stale data and unpredictable behavior.
Thus, on AMD64, it should be safe to mmap()
the same file or shared memory region again, as long as the same prot
and flags
are used; it should cause the kernel to use the same cacheable type to each of the mappings.
The first step is to always use a file backing for the memory maps. Use mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0)
so that the mappings do not reserve swap. (If you forget this, you'll run into swap limits much sooner than you hit actual real life limits for many workloads.) The extra overhead caused by having a file backing is absolutely neglible.
Edited to add: User strcmp pointed out that current kernels do not apply address space randomization to the addresses. Fortunately, this is easy to fix, by simply supplying randomly generated addresses to mmap()
instead of NULL
. On x86-64, the user address space is 47-bit, and the address should be page aligned; you could use e.g. Xorshift* to generate the addresses, then mask out the unwanted bits: & 0x00007FFFFE00000
would give 2097152-byte-aligned 47-bit addresses, for example.
Because the backing is to a file, you can create a second mapping to the same file, after enlarging the backing file using ftruncate()
. Only after a suitable grace period -- when you know no thread is using the mapping anymore (perhaps use an atomic counter to keep track of that?) --, you unmap the original mapping.
In practice, when a mapping needs to be enlarged, you first enlarge the backing file, then try mremap(mapping, oldsize, newsize, 0)
to see if the mapping can be grown, without moving the mapping. Only if the in-place remapping fails, do you need to switch to the new mapping.
Edited to add: You definitely do want to use mremap()
instead of just using mmap()
and MAP_FIXED
to create a larger mapping, because mmap()
unmaps (atomically) any existing mappings, including those belonging to other files or shared memory regions. With mremap()
, you get an error if the enlarged mapping would overlap with existing mappings; with mmap()
and MAP_FIXED
, any existing mappings that the new mapping overlaps are ignored (unmapped).
Unfortunately, I must admit I haven't verified if the kernel detects collisions between existing mappings, or if it just assumes the programmer knows about such collisions -- after all, the programmer must know the address and length of every mapping, and therefore should know if the mapping would collide with anther existing one. Edited to add: The 3.8 series kernels do, returning MAP_FAILED
with errno==ENOMEM
if the enlarged mapping would collide with existing maps. I expect all Linux kernels to behave the same way, but have no proof, aside from testing on 3.8.0-30-generic on x86_64.
Also note that in Linux, POSIX shared memory is implemented using a special filesystem, typically a tmpfs mounted at /dev/shm
(or /run/shm
with /dev/shm
being a symlink). The shm_open()
et. al are implemented by the C library. Instead of having a large POSIX shared memory capability, I'd personally use a specially mounted tmpfs for use in a custom application. If not for anything else, the security controls (users and groups able to create new "files" in there) are much easier and clearer to manage.
If the mapping is, and has to be, anonymous, you can still use mremap(mapping, oldsize, newsize, 0)
to try and resize it; it just may fail.
Even with hundreds of thousands of mappings, the 64-bit address space is vast, and the failure case rare. So, although you must handle the failure case too, it does not necessarily have to be fast.
Edited to modify: On x86-64, the address space is 47-bit, and mappings must start at a page boundary (12 bits for normal pages, 21 bits for 2M hugepages, and 30 bits for 1G hugepages), so there is only 35, 26, or 17 bits available in the address space for the mappings. So, the collisions are more frequent, even if random addresses are suggested. (For 2M mappings, 1024 maps had an occasional collision, but at 65536 maps, the probability of a collision (resize failure) was about 2.3%.)
Edited to add: User strcmp pointed out in a comment that by default Linux mmap()
will return consecutive addresses, in which case growing the mapping will always fail unless it's the last one, or a map was unmapped just there.
The approach I know works in Linux is complicated and very architecture-specific. You can remap the original mapping read-only, create a new anonymous map, and copy the old contents there. You need a SIGSEGV
handler (SIGSEGV
signal being raised for the particular thread that tries to write to the now read-only mapping, this being one of the few recoverable SIGSEGV
situations in Linux even if POSIX disagrees) that examines the instruction that caused the problem, simulates it (modifying the contents of the new mapping instead), and then skips the problematic instruction. After a grace period, when there are no more threads accessing the old, now read-only mapping, you can tear down the mapping.
All of the nastiness is in the SIGSEGV
handler, of course. Not only must it be able to decode all machine instructions and simulate them (or at least those that write to memory), but it must also busy-wait if the new mapping has not been completely copied yet. It is complicated, absolutely unportable, and very architecture-specific.. but possible.
shm_open()
et. al) is implemented via /dev/shm
or /run/shm
, typically a tmpfs filesystem. If I were you, I'd use a dedicated tmpfs for the file "backing", not POSIX shared memory. The mappings use the page cache pages, so there is no duplication anyway. –
Quinsy mremap()
: mmap()
with MAP_FIXED
flag will happily overwrite ALL existing maps, even those belonging to other files (or shared memory regions, which are basically the same thing in Linux). So, if you use mmap(addr,newsize,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_NORESERVE|MAP_FIXED,fd,0)
to "grow" the region, you won't get an error if the new region would overlap with another mapping; it will just succeed. On the other hand, with mremap()
, you get an error if the enlarged mapping collides with another mapping. So, you do need to use mremap()
. –
Quinsy tmpfs
entries in /etc/fstab
, size
attribute for example. For app-specific tmpfs, you can set also uid
, gid
, mode
. This way only specific users can utilize the tmpfs at all. When your service runs as a specific user, you can easily control the resources dedicated to it. As a sysadmin, I find such controls very useful, that's all. –
Quinsy mremap
without MREMAP_MAYMOVE
will almost always fail because everything is tightly packed and growing up will hit the old mappings. It can only succeed when another mapping was unmapped there, fragmenting the virtual address space. The useful thing about mremap
is that it can perform moves via copying the page tables rather than copying data, which is much faster. –
Trammel mmap
: it randomizes the base. Spreading out mappings over the address space will cause a significant performance hit along with pathologically fragmenting the address space into ever smaller gaps which could easily lead to OOM even on 64-bit when a large mapping is requested. –
Trammel memset()
after all mappings are established, including the initial page hit, and compared the times for 1024 mappings, 2M each, spread around the 47-bit address space. No large latencies were detected, and the variance in timing was in the 4-8% range. –
Quinsy Yes, you can do this.
mremap(old_address, old_size, new_size, flags)
deletes the old mapping only of the size "old_size".
So if you pass 0 as "old_size", it will not unmap anything at all.
Caution: this works as expected only with shared mappings, so such mremap() should be used on a region previously mapped with MAP_SHARED. This is actually all of that, i.e. you don't even need a file-backed mapping, you can successfully use "MAP_SHARED | MAP_ANONYMOUS" combination for mmap() flags. Some very old OSes may not support "MAP_SHARED | MAP_ANONYMOUS", but on linux you are safe.
If you try that on a MAP_PRIVATE region, the result would be roughly similar to memcpy(), i.e. no memory alias will be created. But it will still use the CoW machinery. It is not clear from your initial question whether do you need an alias, or the CoW copy is fine too.
UPDATE: for this to work, you also need to specify the MREMAP_MAYMOVE flag obviously.
mremap
useless and leaving OP's problem unsolved. Pages are indeed moved from [old_address
, old_address + min(old_size, new_size)
] to [new_address
, new_address + min(old_size, new_size)
]. –
Publicspirited This was added in the 5.7 kernel as a new flag to mremap(2) called MREMAP_DONTUNMAP. This leaves the existing mapping in place after moving the page table entries.
© 2022 - 2024 — McMap. All rights reserved.
mmap
the same file again and get a different set of addresses. – Joellyn