I would like to be able to zero out a range of a file memory-mapping without invoking any io (in order to efficiently sequentially overwrite huge files without incurring any disk read io).
Doing std::memset(ptr, 0, length)
will cause pages to be read from disk if they are not already in memory even if the entire pages are overwritten thus totally trashing disk performance.
I would like to be able to do something like madvise(ptr, length, MADV_ZERO)
which would zero out the range (similar to FALLOC_FL_ZERO_RANGE
) in order to cause zero fill page faults instead of regular io page faults when accessing the specified range.
Unfortunately MADV_ZERO
does not exists. Even though the corresponding flag FALLOC_FL_ZERO_RANGE
does exists in fallocate
and can be used with fwrite
to achieve a similar effect, though without instant cross process coherency.
One possible alternative I would guess is to use MADV_REMOVE
. However, that can from my understanding cause file fragmentation and also blocks other operations while completing which makes me unsure of its long term performance implications. My experience with Windows is that the similar FSCTL_SET_ZERO_DATA
command can incur significant performance spikes when invoked.
My question is how one could implement or emulate MADV_ZERO
for shared mappings, preferably in user mode?
1. /dev/zero/
I have read it being suggested to simply read /dev/zero
into the selected range. Though I am not quite sure what "reading into the range" means and how to do it. Is it like a fread
from /dev/zero
into the memory range? Not sure how that would avoid a regular page fault on access?
For Linux, simply read
/dev/zero
into the selected range. The kernel already optimises this case for anonymous mappings.If doing it in general turns out to be too hard to implement, I
propose MADV_ZERO should have this effect: exactly like reading
/dev/zero into the range, but always efficient.
EDIT: Following the thread a bit further it turns out that it will actually not work.
It does not do tricks when you are dealing with a shared mapping.
2. MADV_REMOVE
One guess of implementing it in Linux (i.e. not in user application which is what I would prefer) could be by simply copying and modifying MADV_REMOVE
, i.e. madvise_remove
to use FALLOC_FL_ZERO_RANGE
instead of FALLOC_FL_PUNCH_HOLE
. Though I am bit over my head in guessing this, especially as I don't quite understand what the code around the vfs_allocate
is doing:
// madvice.c
static long madvise_remove(...)
...
/*
* Filesystem's fallocate may need to take i_mutex. We need to
* explicitly grab a reference because the vma (and hence the
* vma's reference to the file) can go away as soon as we drop
* mmap_sem.
*/
get_file(f); // Increment ref count.
up_read(¤t->mm->mmap_sem); // Release a read lock? Why?
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, // FALLOC_FL_ZERO_RANGE?
offset, end - start);
fput(f); // Decrement ref count.
down_read(¤t->mm->mmap_sem); // Acquire read lock. Why?
return error;
}
/dev/zero
into the selected range" refers to the following technique seen inshmem_zero_setup()
within the Linux kernel. – Encausticdo_mmap_pgoff()
withflags = MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE
and follow the code-path tillshmem_zero_setup()
for a complete picture. For an anonymous mapping, in the absence of a file backing a zero-page is used initially to optimise reads. Of-course this is NOT a solution to your problem. Its just an example of a proper implementation that you can refer to if you want to implement the suggestion yourself within the kernel. – EncausticO_DIRECT | O_WRONLY
,lseek()
to the proper offset and simply dump large blocks (multiples of disk block size) untillen
number of bytes are zerou-ed out. Apparently this works with a couple of alignment and offset restrictions onmmap()
. What do you think?... – Encausticsync()
that i could call to trigger a "inotification", i wrote one. :P In my design, the writer thread would updates all the relevant fields in the memory-mapped shared-file and finally trigger amsync()
upon which the reader thread(s) waiting using inotify would be unblocked. I hope it helps... – Encaustic