Is it true, that modern OS may skip copy when realloc is called

Asked 27/5, 2013 at 1:46 Answered 27/5, 2013 at 18:32

Solved memory-management linux-kernel realloc

While reading the https://mcmap.net/q/76169/-is-it-better-to-allocate-memory-in-the-power-of-two I have a question. What the Qt authors says in the Inside the Qt 4 Containers:

... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.

My questions are:

1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?

2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))

3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)

upd for q3: some tests of mremap vs memcpy

Innovate answered 27/5, 2013 at 1:46 Comment(4)

realloc() is implemented in the C library. On Linux, libc will typically by the eglibc/glibc version of Doug Lea's malloc. This is a binning allocator and it the special HAVE_MREMAP, which is defined by default for linux. splice() is a completely different concept. A TLB invalidate is usually 4-bytes. So unless the realloc is 1024*4k/10 cores or ~512KB, the mremap() is better. It is still probably better as a copy will blow-up the d-cache as well. – Tarantella 27/5, 2013 at 15:53

"TLB invalidate is usually 4-bytes." - is it a typo? TLB invalidate is IPI and writing to CR3 to reset all TLB lines. – Innovate 27/5, 2013 at 16:0

artless noise, the size of PTE entry is small; but we should not only update the page table in the memory, we must also update TLB entry. Usually there is no direct access to individual TLB lines, so full TLB flush still needed. If I asking for realloc, I touched memory. – Innovate 27/5, 2013 at 16:19

Not on ARM, you can invalidate individual TLB's. But as noted, this is the worst case. mremap() may just expand the virtual range to map an additional physical page (from a random address in the free pool). If the realloc() memory is sparse, then half+ of the pages may not even have been touched and many virtual pages may map to the zero page. Copying would increase the memory used for this sparse use case. – Tarantella 27/5, 2013 at 16:23

Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?

What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))

See thejh's answer.

Who are the actors?

You have at least three actors with your Qt example.

Qt Vector class
glibc's realloc()
Linux's mremap

QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.

An answer

Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)

First off, faster ... than mremap is mis-using mremap() as R notes there.

There are several things that make mremap() valuable as a primitive for realloc().

Reduced memory consumption.
Preserving page mappings.
Avoiding moving data.

Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.

Reduce memory consumption

Consider a naive realloc().

void *realloc(void *ptr, size_t size)
{
    size_t old_size = get_sz(ptr);  /* From bin, address, map table, etc */
    if(size <= old_size) {
      resize(ptr);
      return ptr;
    }    
    void * new_p = malloc(size);
    if(new_p) {
      memcpy(new_p, ptr, old_size);  /* fully committed old_size + new size */
      free(ptr); 
    }
    return new_p;
}

In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.

Preserve page mappings

Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.

If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.

If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.

Avoid moving data

The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.

CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.

On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.

Conclusion

You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,

The realloc() memory should not be shared with a parent or child process.
The memory should not be sparse (mostly zero or untouched).
The system should not be under memory pressure using swap.

The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.

Tarantella answered 27/5, 2013 at 18:32 Comment(7)

still not understand your 4byte per TLB. This is inter core interrupt from core where mremap was started to all other cores; and we ask them to partially flush their tlb's. TLB is not PTE, TLB is CACHE for PTE. And this cache will be out of coherency if we change PTE in memory but don't flush TLB's copy of PTE. – Innovate 27/5, 2013 at 18:44

and really your answer for q1 is "true" -- there are some systems capable of doing so, Thanks! – Innovate 27/5, 2013 at 18:47

A single TLB entry maps one Linux page. A Linux page is 4k of memory. Each TLB is like a pointer, it is only 4bytes on average. So even if you flush the entire TLB cache on a full 4GB system, the TLB entries are only 4MB ((4GB/4k)*4). Each TLB entry is a PTE; for a full 4GB system, there are 4MB worth of PTEs. A single TLB entry is a PTE; the TLB is an MMU cache. – Tarantella 27/5, 2013 at 19:5

And actually, the entire TLB cache is usually limited to 1k to 16k in size; it doesn't usually need to be as big as i-cache and d-cache as a single entry covers much more address space. It is a performance hit to reload it, but not as expensive as the other caches; and that is worst case for mremap(). – Tarantella 27/5, 2013 at 19:20

jemalloc has also used mremap, but it has been removed as of version 4.0 (2015); so apparently modern changes. It is listed as being problematic for coherency with multi-threaded applications. However, I didn't find specifics. – Tarantella 28/6, 2019 at 14:42

@artlessnoise I don't know why it should be removed because it is supported by the Kernel. On even more-modern Linux Kernels, mremap() on PMD_SIZE=2MB aligned data can make a huge speedup - see lwn.net/Articles/833208 (and even better with PUD_SIZE=1GB but I guess this is specific to some tasks, e.g. a VM resize, not from a normal malloc). This article gives some hints about the internals of TLB mapping. – Gwendolyngweneth 14/10, 2022 at 7:48

@ArnaudBouchez Yes, this benchmark show improvement for some use cases. However, I think jemalloc is balancing in-use versus 'to be used' and I think it definitely depends on the cache type (VIVT, PIPT, etc). However, it was interesting to note in the context of this question. – Tarantella 14/10, 2022 at 13:14

For Linux, see man 2 mremap:

   void *mremap(void *old_address, size_t old_size,
                size_t new_size, int flags, ... /* void *new_address */);

mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same time (controlled by the flags argument and the available virtual address space).

Maladminister answered 27/5, 2013 at 1:52 Comment(2)

And is it used by realloc? Also, some people says that there is no mremap in OSX: #3521803 – Innovate 27/5, 2013 at 2:1

And accoring to ptgmedia.pearsoncmg.com/images/0131453483/samplechapter/… chapter 3 page 44 table 3.2, mremap does call flush_tlb_range – Innovate 27/5, 2013 at 2:9