How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?

Asked 7/6, 2019 at 17:42 Answered 11/1, 2024 at 21:12

According to mmap() manpage:

MAP_PRIVATE

Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file. It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.

Question: How to prevent changes to the underlying file after mmap()-ing a file from being visible to my program?

Background: I am designing a data structure for a text editor designed to allow editing huge text files efficiently. The data structure is akin to an on-disk rope but with the actual strings being pointer to mmap()-ed ranges from the original file.

Since the file could be very large, there are a few restrictions around the design:

Must not load the entire file into RAM as the file may be larger than available physical RAM
Must not copy files on opening as this will make opening new files really slow
Must work on filesystems like ext4 that does not support copy-on-write (cp --reflink/ioctl_ficlone)
Must not rely on mandatory file locking, as this is deprecated, and requires specific mount option -o mand in the filesystem
As long as the changes aren't visible in my mmap(), it's ok for the underlying file to change on the filesystem
Only need to support recent Linux and using Linux-specific system APIs are ok

The data structure I'm designing would keep track of a list of unedited and edited ranges in the file by storing start and end index of the ranges into the mmap()-ed buffer. While the user is browsing through the file, ranges of text that have never been modified by the user would be read directly from a mmap() of the original file, while a swap file will store the ranges of texts that have been edited by the user but had not been saved.

When the user saves a file, the data structure would use copy_file_range to splice the swap file and the original file to assemble the new file. For this splicing to work, the original file as seen by my program must remain unchanged throughout the entire editing session.

Problem: The user may concurrently have other programs modifying the same file, possibly other text editors or some other programs that modified the text file in-place, after making unsaved changes in my text editor.

In such situation, the editor can detect such external change using inotify, and then I want to give the user two options on how to continue from this:

discard all unsaved changes and re-read the file from disk, implementing this option is fairly straightforward
allow the user to continue editing the file and later on the user should be able to save the unsaved changes in a new location or to overwrite the changes that had been made by the other program, but implementing this seems tricky

Since my editor did not make a copy of the file when it opened the file, when the other program overwrite the file, the text ranges that my data structure are tracking may become invalid because the data on-disk have changed and these changes are now visible through my mmap(). This means if my editor tried to write unsaved changes after the file has been modified from another process, it could be splicing text ranges in the old file using data from the data from the new file, which could mean that my editor could be producing a corrupt file when saving the unsaved changes.

I don't think advisory locks would have saved the situation here in all cases, as other programs may not honor advisory lock.

My ideal solution would be to make it so that when other programs overwrites the file, the system should transparently copy the file to allow my program to continue seeing the old version while the other program finishes their write to disk and make their version visible in the filesystem. I think ioctl_ficlone could have made this possible, but to my understanding, this only works with a copy-on-write filesystem like btrfs.

Is such a thing possible?

Any other suggestions to solve this problem would also be welcome.

Bullyrag answered 7/6, 2019 at 17:42 Comment(3)

If there's no option to mmap() to do this, I don't think there's a good solution. – Relator 7/6, 2019 at 18:16

It's an interesting question, but my suggestion would be to not jump through hoops to try to protect the user from themself. Advisory locks and a warning on modification is plenty. If a user wants to destroy a file, Unix offers a hundred more convenient footguns than to ignore locks and warnings in a large file editor. – Infuscate 7/6, 2019 at 18:38

Maybe not what you are looking for, but why not use the Linux user/group mechanism to forbid access/writing to other processes: just create a new user for your editor and set the appropriate access rights, maybe 744? that way, only root could mess your file. – Marsupium 11/1, 2024 at 8:37

What you want to do isn't possible with mmap, and I'm not sure if it's possible at all with your constraints.

When you map a region, the kernel may or may not actually load all of it into memory. The region of memory that lacks data will actually contain an invalid page, so when you access it, the kernel takes a page fault and maps that region into memory. That region will likely contain whatever is in that portion of the file at the time the page fault occurs. There is an option, MAP_LOCKED, which tries to prefault all of the pages in, but doesn't guarantee it, so you can't rely on it working.

In general, you cannot prevent other processes from changing a file out from under you. Some tools (including editors) will write a new file to the side, calling rename to overwrite the file, and some will rewrite the file in place. The former is what you want, but many editors choose to do the latter, since it preserves characteristics such as ACLs and permissions you can't restore.

Furthermore, you really don't want to use mmap on any file you can't totally control, because if another process truncates the file and you try to access that portion of the buffer, your process will die with SIGBUS. Catching this signal is undefined behavior, and the only sane thing to do is die. (Also, it can be sent in other situations, such as unaligned access, and you'll have a hard time distinguishing between them.)

Ultimately, if you're not interested in copying the file, you can't guarantee someone won't change underneath you, and you'll need to be prepared for that to occur.

Unhallow answered 7/6, 2019 at 22:5 Comment(0)

-1

Any other suggestions to solve this problem would also be welcome.

The Linux way to deal with exclusive access to resources (files on disk, serial ports, audio dsps, etc...) is to make use of the user/group mechanism. Make your process the owner of the file and forbid access to anyone but you.

In man 2 chown and man 2 chmod you can find information about how to do it programmatically. Also, The Linux programming interface (chapters 8 and 9) by Michael Kerrisk can give you a more comprehensive explanation of the relevant libraries and system calls.

Marsupium answered 11/1, 2024 at 21:12 Comment(0)

Recommended topics

Hot tags