How to portably extend a file accessed using mmap()

Asked 28/3, 2013 at 14:43 Answered 18/7, 2018 at 1:34

We're experimenting with changing SQLite, an embedded database system, to use mmap() instead of the usual read() and write() calls to access the database file on disk. Using a single large mapping for the entire file. Assume that the file is small enough that we have no trouble finding space for this in virtual memory.

So far so good. In many cases using mmap() seems to be a little faster than read() and write(). And in some cases much faster.

Resizing the mapping in order to commit a write-transaction that extends the database file seems to be a problem. In order to extend the database file, the code could do something like this:

  ftruncate();    // extend the database file on disk 
  munmap();       // unmap the current mapping (it's now too small)
  mmap();         // create a new, larger, mapping

then copy the new data into the end of the new memory mapping. However, the munmap/mmap is undesirable as it means the next time each page of the database file is accessed a minor page fault occurs and the system has to search the OS page cache for the correct frame to associate with the virtual memory address. In other words, it slows down subsequent database reads.

On Linux, we can use the non-standard mremap() system call instead of munmap()/mmap() to resize the mapping. This seems to avoid the minor page faults.

QUESTION: How should this be dealt with on other systems, like OSX, that do not have mremap()?

We have two ideas at present. And a question regarding each:

1) Create mappings larger than the database file. Then, when extending the database file, simply call ftruncate() to extend the file on disk and continue using the same mapping.

This would be ideal, and seems to work in practice. However, we're worried about this warning in the man page:

"The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified."

QUESTION: Is this something we should be worried about? Or an anachronism at this point?

2) When extending the database file, use the first argument to mmap() to request a mapping corresponding to the new pages of the database file located immediately after the current mapping in virtual memory. Effectively extending the initial mapping. If the system can't honour the request to place the new mapping immediately after the first, fall back to munmap/mmap.

In practice, we've found that OSX is pretty good about positioning mappings in this way, so this trick works there.

QUESTION: if the system does allocate the second mapping immediately following the first in virtual memory, is it then safe to eventually unmap them both using a single big call to munmap()?

Subsonic answered 28/3, 2013 at 14:43 Comment(2)

I've been doing exactly the same thing. On Solaris 10 munmap does a synchronous msync if I remember correctly. In fact msync was always synchronous on Solaris 10 even when MS_ASYNC was specified. These were a couple of the last nails in Solaris coffin. – Nagel 28/3, 2013 at 15:27

I don't think #1 is feasible. Creating a mapping larger than the file results in the tail end of the file not being accessible (although it may be "mapped"), and ftruncate() won't update the mapping. – Barque 28/3, 2013 at 18:24

2 will work but you don't have to rely on the OS happening to have space available, you can reserve your address space beforehand so your fixed mmapings will always succeed.

For instance, To reserve one gigabyte of address space. Do a

mmap(NULL, 1U << 30, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

Which will reserve one gigabyte of continuous address space without actually allocating any memory or resources. You can then perform future mmapings over this space and they will succeed. So mmap the file into the beginning of the space returned, then mmap further sections of the file as needed using the fixed flag. The mmaps will succeed because your address space is already allocated and reserved by you.

Note: linux also has the MAP_NORESERVE flag which is the behavior you would want for the initial mapping if you were allocating RAM, but in my testing it is ignored as PROT_NONE is sufficient to say you don't want any resources allocated yet.

Bandmaster answered 18/7, 2018 at 1:34 Comment(2)

how should we handle the last page in case the file size is not aligned to the page size? In case we need to grow the mapping, should we remap the last page and start from there? – Slowworm 10/9, 2021 at 5:38

You can do that if you want, you would map your whole file in anyway including the last page and whatever excess space you want, if you map it writable you can change just part of the page and it should just work. You can call ftruncate to chop off anything extra if the os rounded your file size up. – Bandmaster 14/9, 2021 at 11:10

Use fallocate() instead of ftruncate() where available. If not, just open file in O_APPEND mode and increase file by writing some amount of zeroes. This greatly reduce fragmentation.
Use "Huge pages" if available - this greatly reduce overhead on big mappings.
pread()/pwrite()/pwritev()/preadv() with not-so-small block size is not slow really. Much faster than IO can actually be performed.
IO errors when using mmap() will generate just segfault instead of EIO or so.
The most of SQLite WRITE performance problems is concentrated in good transactional use (i.e. you should debug when COMMIT actually performed).

Haggis answered 12/5, 2015 at 4:50 Comment(6)

Using fallocate() defeats delayed allocation, forcing disk seeks and metadata updates to allocate physical blocks for the new file region immediately, rather than allowing allocation to occur when the dirtied pages are later flushed. In fact, using fallocate() can worsen fragmentation if multiple files are being extended concurrently: you'll end up with their blocks interleaved on disk. Generally, you should only use fallocate() to preallocate a large file whose size you know in advance (such as a file to be copied or downloaded). – Droop 29/10, 2016 at 6:47

@Matt Whitlock: your comment is basically wrong on all accounts - fallocate does not defeat delayed allocation in any way, does not force disk seeks or metadata updates or does anything more immediatelly than other forms of I/O. All it does is allocate space in advance, which typically reduces fragmentation, and never increases it unless fallocate is called more often then writing to the map. fallocate also doesn't worsen fragmentation over writes - it's basically always a win. – Murdock 15/10, 2018 at 15:32

@MarcLehmann: Unfortunately you're incorrect on all accounts. Try it if you don't believe me.

rm -f {del,pre}alloc && dd if=/dev/zero of=delalloc bs=16M count=1 && fallocate -l$((16<<20)) prealloc && dd if=/dev/zero of=prealloc bs=16M count=1 conv=notrunc && filefrag -v {del,pre}alloc

You will see that the delalloc file shows unknown_loc,delalloc on its extent whereas the prealloc file immediately has a physical offset on disk. Finding a physical location for the extent requires accessing the free-space B-tree on disk. For fun, do a sync and then run the filefrag command again. ;) – Droop 20/3, 2019 at 20:8

@MarcLehmann: fallocate also does worsen external fragmentation. If you write several files of various sizes and you fallocate each before writing it, then the file system must immediately choose where to place each file individually, and it will often put each into a space that is just big enough to hold it. If, however, you omit the fallocate calls, then the file system is free to allocate and flush all of the files in one big contiguous chunk, which is, of course, beneficial later when you're reading those files as a group. – Droop 20/3, 2019 at 20:18

@MattWhitlock Your test is invalid, you can't compare the fallocate binary with the fallocate systcll. The fallocate binary allocate then fsync and close the file, forcing an immediate write of metadata and data! Also dd used to fsync the file at the end but no longer do apparently. Try adding conv=fsync to your first dd and you'll actually see different results - both will be allocated on disk. – Talkfest 15/9, 2022 at 14:4

@MattWhitlock Also delalloc and fallocate both achieve the same role, so it doesn't matter delalloc isn't compatible with fallocate, either one serve the same purpose. Delayed allocation lets you write more data before blocks are allocated, allowing the file system to find a suitable extend to allocate the file and prevent fragmentation. fallocate does the same thing, but upfront, so that you have no risk of getting ENOSPC during the write. It will also allow allocating blocks beyond the dirty data high watermark / 30s sync delay. – Talkfest 15/9, 2022 at 14:9

I think #2 is the best currently available solution. In addition to this, on 64bit systems you may create your mapping explicitly at an address that OS would never choose for an mapping (for example 0x6000 0000 0000 0000 in Linux) to avoid the case that OS cannot place the new mapping immediatly after the first one.
It is always safe to unmap mutiple mappinsg with a single munmap call. You can even unmap a part of the mapping if you wish to do so.

Osana answered 23/5, 2013 at 5:6 Comment(1)

most real-world 64 bit implementations (i.e. actual cpus) do not support 64 bit address spaces. for example, none of the existing amd64 cpus support the 0x6000 0000 0000 0000 address. – Murdock 14/8, 2014 at 10:8

Recommended topics

Hot tags