Atomic write to multiple files
Asked Answered
B

4

7

Suppose I have a set of files. How do I ensure that writing to all these files is atomic.

I thought about writing to temp files and only after the writing is successful, perform an atomic rename of each file. However renaming all the files at once isn't atomic. Also this will not scale to very large files if we'd like to append to these files.

Instead I thought about implementing transactions but then that's becoming a project on its own. I realize that this is pretty much about implementing a mini database.

How would you do it in Python?

d = FileWriter.open(['file1', 'file2'], 'wb+')
d.write('add hello world to files')
d.close()

Ensure that d.write is atomic or at least rollback to original files if unsuccessful.

Boom answered 18/8, 2012 at 22:25 Comment(7)
does'nt file-locking (which is implemented by neary every filesystem) ensure atomic operations? d = FileWriter.open(['file1', 'file2'], 'wb+') i think this statement will put a write-lock on that file. the only thing you have to ensure that a different process checks the lock-status of a file.Conditioned
I just made FileWriter up. Naively it would sequentially write to two files. If something breaks in between, you are out of luck. Imagine these files represent a search engine index. Because you only wrote to one file and not to the other, you now have corrupted index.Boom
a ok, i missed a bit.you need a atomic operation over all filesConditioned
Why don't you just use a database? This is what they're good at.Shayne
I think what I could do is have a repair mechanism. If something went wrong, fix the files.Boom
@AlexK: Does this have to be cross-platform? If not, which OS?Winzler
Yes it would have to be cross-platform.Boom
B
7

Here is what I'm thinking. First ensure open is synchronized, then perform the following:

  1. Write to temp files: file1~, file2~ and special file success~ (must be written first).
  2. After a successful write, remove file success~.
  3. Rename the files to file1 and file2.

If something breaks:

  1. Check if success~ exists.
  2. If it does, do not bother repairing. A rollback was implicitly performed because the files were not updated (no renames).
  3. If success~ does not exist, things broke after writing and during renaming. In this case repairing is as simple as renaming filex~ to filex.
Boom answered 18/8, 2012 at 23:27 Comment(3)
Why go through all this trouble when databases come with this built-in?Corrigendum
Because I want those files to be plain text humanly readable. I also want the files to be easily shared across users. In fact I'd like anyone who knows the format of those files to be able to create them without using x database. It's for a mini search engine index.Boom
Plus, that's a bit like answering a question about ironing suit pants with the suggestion of just wearing blue jeans instead. It's not friendly, and certainly not helpful.Erkan
D
2

Your requirement, in its current statement, is so generic that the only plausible answer is that OS level support is necessary, and it's not even strictly a Python-related question anymore. See what is said about Transactional file systems, for example here and here. A small excerpt regarding the solutions proposed so far:

Ensuring consistency across multiple file system operations is difficult, if not impossible, without file system transactions. File locking can be used as a concurrency control mechanism for individual files, but it typically does not protect the directory structure or file metadata. For instance, file locking cannot prevent TOCTTOU race conditions on symbolic links. File locking also cannot automatically roll back a failed operation, such as a software upgrade; this requires atomicity.

So I would suggest to rethink your problem (and maybe your question too?) and analyze your requirements with greater detail. The bottom line is that the less guarantees you need, the simpler solution you can find. Maybe you can get away with something as simple as file locking, or maybe you will find that a database will be needed.

Speaking of which, if you're thinking about a filesystem because the components of your architecture need to access files, have you thought about FUSE as a way to build a filesystem-like facade on top of a regular database? Using python-fuse is a breeze.

Decca answered 20/8, 2012 at 5:38 Comment(1)
Actually in my case, it seems that a repair mechanism would be sufficient. Attempt to write to those files and if something went wrong repair them at a later stage.Boom
E
1

The reason that database management systems usually either use exactly one write-ahead log file, or alternatively a custom file system, is that the usual file systems are really bad at making any guarantees about atomicity or order of execution.

They are entirely optimized for consistency and performance.

Therefore, only the order of writes into a single file can be trusted, unless you create a partition or image file with a specialized file system.

You can, though, write transaction numbers with every write to multiple files.

And make it obvious whether a write was complete, e.g. with xml or json style blocks with begin / end markers like <element>...</element> or { ... }, &c.

Then, your code can easily detect any gaps across multiple files and determine a last consistent state after a crash.

To avoid the last consistent state after a crash ending up arbitrarily old because some write has been waiting in the cache for minutes, any of these approaches can be combined with sync / fsync.

Using sync / fsync also makes transactional commits kind of possible, i.e. guaranteeing that everything until now was written at least from the file system's point of view.

Your storage system might then still lose the last write because of a power outage, though, be it a hard disk or SSD with an internal cache, a NAS, &c. The guarantees these systems give can vary wildly, which is a challenge to be considered with all approaches, whether one is using the file system or a traditional RDBMS for storage.

Using a UPS is definitely a great addition if you're writing to a built-in hard disk or SSD, especially if you shut your system down in a controlled manner once the USP signals that it's about time to do so ^^

Erkan answered 6/3, 2017 at 8:22 Comment(0)
P
0

You could try using flock to achieve your goal. It can be achieved in python with fcntl.flock. Do note that this is only an advisory measure. To really guarantee what you want, you should try and use a database, or filesystem/kernel which supports strict locking.

Pedicle answered 18/8, 2012 at 23:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.