How to implement or emulate MADV_ZERO?
Asked Answered
E

1

7

I would like to be able to zero out a range of a file memory-mapping without invoking any io (in order to efficiently sequentially overwrite huge files without incurring any disk read io).

Doing std::memset(ptr, 0, length) will cause pages to be read from disk if they are not already in memory even if the entire pages are overwritten thus totally trashing disk performance.

I would like to be able to do something like madvise(ptr, length, MADV_ZERO) which would zero out the range (similar to FALLOC_FL_ZERO_RANGE) in order to cause zero fill page faults instead of regular io page faults when accessing the specified range.

Unfortunately MADV_ZERO does not exists. Even though the corresponding flag FALLOC_FL_ZERO_RANGE does exists in fallocate and can be used with fwrite to achieve a similar effect, though without instant cross process coherency.

One possible alternative I would guess is to use MADV_REMOVE. However, that can from my understanding cause file fragmentation and also blocks other operations while completing which makes me unsure of its long term performance implications. My experience with Windows is that the similar FSCTL_SET_ZERO_DATA command can incur significant performance spikes when invoked.

My question is how one could implement or emulate MADV_ZERO for shared mappings, preferably in user mode?

1. /dev/zero/

I have read it being suggested to simply read /dev/zero into the selected range. Though I am not quite sure what "reading into the range" means and how to do it. Is it like a fread from /dev/zero into the memory range? Not sure how that would avoid a regular page fault on access?

For Linux, simply read /dev/zero into the selected range. The kernel already optimises this case for anonymous mappings.

If doing it in general turns out to be too hard to implement, I
propose MADV_ZERO should have this effect: exactly like reading
/dev/zero into the range, but always efficient.

EDIT: Following the thread a bit further it turns out that it will actually not work.

It does not do tricks when you are dealing with a shared mapping.

2. MADV_REMOVE

One guess of implementing it in Linux (i.e. not in user application which is what I would prefer) could be by simply copying and modifying MADV_REMOVE, i.e. madvise_remove to use FALLOC_FL_ZERO_RANGE instead of FALLOC_FL_PUNCH_HOLE. Though I am bit over my head in guessing this, especially as I don't quite understand what the code around the vfs_allocate is doing:

// madvice.c
static long madvise_remove(...)
  ...
  /*
   * Filesystem's fallocate may need to take i_mutex.  We need to
   * explicitly grab a reference because the vma (and hence the
   * vma's reference to the file) can go away as soon as we drop
   * mmap_sem.
   */
  get_file(f); // Increment ref count.
  up_read(&current->mm->mmap_sem); // Release a read lock? Why?
  error = vfs_fallocate(f,
            FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, // FALLOC_FL_ZERO_RANGE?
            offset, end - start);
  fput(f); // Decrement ref count.
  down_read(&current->mm->mmap_sem); // Acquire read lock. Why?
  return error;
}
Erepsin answered 31/8, 2015 at 23:56 Comment(9)
Quite possibly the "simply read /dev/zero into the selected range" refers to the following technique seen in shmem_zero_setup() within the Linux kernel.Encaustic
@TheCodeArtist: Not quite sure what to make out of that...Erepsin
Start from do_mmap_pgoff() with flags = MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE and follow the code-path till shmem_zero_setup() for a complete picture. For an anonymous mapping, in the absence of a file backing a zero-page is used initially to optimise reads. Of-course this is NOT a solution to your problem. Its just an example of a proper implementation that you can refer to if you want to implement the suggestion yourself within the kernel.Encaustic
On the other hand, if you are going to write a fixed pattern of blocks to the disk (i.e. long sequence of zeroes), why not open the filefile in O_DIRECT | O_WRONLY, lseek() to the proper offset and simply dump large blocks (multiples of disk block size) until len number of bytes are zerou-ed out. Apparently this works with a couple of alignment and offset restrictions on mmap(). What do you think?...Encaustic
@TheCodeArtist: That would work if I was only writing to the file. But other processes are reading from the memory just a few seconds later so I need the data in cache. Basically I need to combine IPC with persistance.Erepsin
@TheCodeArtist: Thanks for code-path. Though that is a bit to advanced for me right now. I might get back to it later. I just switched to Linux after 10 years of Windows and Visual Studio.Erepsin
IPC with persistence brings back some golden memories of yesteryear. I was playing around with mmap-ed files with the exact same intention. I was using inotify to monitor the file-backed "shared-memory". As there was no equivalent to sync() that i could call to trigger a "inotification", i wrote one. :P In my design, the writer thread would updates all the relevant fields in the memory-mapped shared-file and finally trigger a msync() upon which the reader thread(s) waiting using inotify would be unblocked. I hope it helps...Encaustic
@TheCodeArtist: That does help. Currently I am using atomic write/reads to "tags" in the mapped memory and then simply poll in the reader. Your solution sounds safer and more efficient. However, it still doesn't solve this problem with the significantly reduced throughput caused by the unnecessary page-faults when overwriting.Erepsin
Glad to know that. However I cannot claim to be the original source of the idea though. IIRC, i managed to fish some half-baked code off the "internets" and ironed it out to fit my use-case. Googling "inotify IN_SYNC" should point you to few LKML discussions on the rationale/implementation.Encaustic
C
1

You probably cannot do what you want (in user space, without hacking the kernel). Notice that writing zero pages might not incur physical disk IO because of the page cache.

You might want to replace a file segment by a file hole (but this is not exactly what you want) in a sparse file, but some file systems (e.g. VFAT) don't have holes or sparse files. See lseek(2) with SEEK_HOLE, ftruncate(2)

Costar answered 1/9, 2015 at 8:59 Comment(2)
Yes, that is what MADV_REMOVE accomplishes. However, as we both noted, it causes the file to become sparse and also punching the hole locks the file from all other operations while completing (which in .e.g Windows is very slow, haven't tested on Linux as of yet). Which makes me sceptical about its long term performance implications. The use case I need this for is 24/7 writes without any time for defragmentation.Erepsin
I believe you actually have the same problem with the file/page cache. Since the page cache itself uses memory mapped sections but on larger boundaries. Not sure though.Erepsin

© 2022 - 2024 — McMap. All rights reserved.