How to ensure that data doesn't get corrupted when saving to file?

R

3

8

I am relatively new to C# so please bear with me.

I am writing a business application (in C#, .NET 4) that needs to be reliable. Data will be stored in files. Files will be modified (rewritten) regularly, thus I am afraid that something could go wrong (power loss, application gets killed, system freezes, ...) while saving data which would (I think) result in a corrupted file. I know that data which wasn't saved is lost, but I must not lose data which was already saved (because of corruption or ...).

My idea is to have 2 versions of every file and each time rewrite the oldest file. Then in case of unexpected end of my application at least one file should still be valid.

Is this a good approach? Is there anything else I could do? (Database is not an option)

Thank you for your time and answers.

Renascent answered 31/10, 2011 at 17:49 Comment(3)

Out of curiosity, why is database not an option? – Ranchod 31/10, 2011 at 17:52

how do you update those files? add a line, add bytes? what is the frequency of updates? – Joanjoana 31/10, 2011 at 17:55

Files will be completely rewritten, though only some parts will change. Files are relatively small (should be less than 1MB). Some files only a couple of time per day, others every 5-10 minutes on average. – Renascent 31/10, 2011 at 18:1

B

5

A lot of programs uses this approach, but usually, they do more copies, to avoid also human error.

For example, Cadsoft Eagle (a program used to design circuits and printed circuit boards) do up to 9 backup copies of the same file, calling them file.b#1 ... file.b#9

Another thing you can do to enforce security is to hashing: append an hash like a CRC32 or MD5 at the end of the file. When you open it you check the CRC or MD5, if they don't match the file is corrupted. This will also enforce you from people that accidentally or by purpose try to modify your file with another program. This will also give you a way to know if hard drive or usb disk got corrupted.

Of course, faster the save file operation is, the less risk of loosing data you have, but you cannot be sure that nothing will happen during or after writing.

Consider that both hard drives, usb drives and windows OS uses cache, and it means, also if you finish writing the data may be OS or disk itself still didn't physically wrote it to the disk.

Another thing you can do, save to a temporary file, if everything is ok you move the file in the real destination folder, this will reduce the risk of having half-files.

You can mix all these techniques together.

Bithynia answered 31/10, 2011 at 17:52 Comment(4)

I would generally follow this route, possibly with the added logic of renaming past revisions so that file.b#1 is always the newest copy. – Fagin 31/10, 2011 at 17:57

Yes exactly this is what Eagle does. It also delete the oldest file, keeping at max 9 backup files. You can change this. – Bithynia 31/10, 2011 at 18:1

@SalvatorePreviti But is there no more reliable way? Like how Microsoft Visual Studio do save editing of source code files without this approach and changes are immediately committed to disk? (at least I have not come across someone yelling that he got corrupted source code file when there was a sudden power failure at the middle of a Ctrl+S operation in MS Visual Studio). Now I am particular about immediately committing changes every time a file is written to disk immediately. At least I know that even database engines save files to disk. – Themistocles 14/12, 2014 at 16:0

I did come accross people who got corrupted files working with MSVS2017. – Grate 31/8, 2018 at 19:36

R

13

Rather than "always write to the oldest" you can use the "safe file write" technique of:

(Assuming you want to end up saving data to foo.data, and a file with that name contains the previous valid version.)

Write new data to foo.data.new
Rename foo.data to foo.data.old
Rename foo.data.new to foo.data
Delete foo.data.old

At any one time you've always got at least one valid file, and you can tell which is the one to read just from the filename. This is assuming your file system treats rename and delete operations atomically, of course.

If foo.data and foo.data.new exist, load foo.data; foo.data.new may be broken (e.g. power off during write)
If foo.data.old and foo.data.new exist, both should be valid, but something died very shortly afterwards - you may want to load the foo.data.old version anyway
If foo.data and foo.data.old exist, then foo.data should be fine, but again something went wrong, or possibly the file couldn't be deleted.

Alternatively, simply always write to a new file, including some sort of monotonically increasing counter - that way you'll never lose any data due to bad writes. The best approach depends on what you're writing though.

You could also use File.Replace for this, which basically performs the last three steps for you. (Pass in null for the backup name if you don't want to keep a backup.)

Rocca answered 31/10, 2011 at 17:56 Comment(7)

In theory this should be enough, but it is not in practice due to OS caches when it comes to file writes. I've implemented this method and seen both the .data and .data.old file become corrupt. More on the subject: #383824 – Contralto 28/9, 2015 at 9:14

Would it be noteworthy to add a section about File.Replace in this answer? It does a lot of this work for you, and as far as I understand from the ReplaceFile API docs, the actual data swap either works or it doesn't - your destination file data will never be half written or anything like that. – Reddish 4/7, 2016 at 16:50

@MikeMarynowski: Gosh, I'd never seen that before. Yes, will add a note. – Rocca 4/7, 2016 at 16:51

It works for any kind of content including binary, not just text. The backup file name can be null if you don't care to keep an old copy :) You do the first step, write new data, and then ReplaceFile does the last 3 steps in one go and keeps the file system in a consistent state in failure conditions. – Reddish 4/7, 2016 at 16:59

@MikeMarynowski: Doh, had misread the documentation. Trying to do too many things at once. Thanks, will edit. – Rocca 4/7, 2016 at 17:0

this is fine for existing files but when having no data (initial start) it doesn't seem to work. correct? – Employer 18/5, 2018 at 12:0

@juFo: No, it should work fine - just skip the "rename foo.data" and "delete foo.data.old" parts, as the file doesn't exist. Yes, you need to detect that, but that's not difficult. (And if just foo.data.new exists when reading, then that may be broken so you should probably treat it as if it didn't exist at all.) – Rocca 18/5, 2018 at 13:32

B

5

A lot of programs uses this approach, but usually, they do more copies, to avoid also human error.

For example, Cadsoft Eagle (a program used to design circuits and printed circuit boards) do up to 9 backup copies of the same file, calling them file.b#1 ... file.b#9

Another thing you can do to enforce security is to hashing: append an hash like a CRC32 or MD5 at the end of the file. When you open it you check the CRC or MD5, if they don't match the file is corrupted. This will also enforce you from people that accidentally or by purpose try to modify your file with another program. This will also give you a way to know if hard drive or usb disk got corrupted.

Of course, faster the save file operation is, the less risk of loosing data you have, but you cannot be sure that nothing will happen during or after writing.

Consider that both hard drives, usb drives and windows OS uses cache, and it means, also if you finish writing the data may be OS or disk itself still didn't physically wrote it to the disk.

Another thing you can do, save to a temporary file, if everything is ok you move the file in the real destination folder, this will reduce the risk of having half-files.

You can mix all these techniques together.

Bithynia answered 31/10, 2011 at 17:52 Comment(4)

I would generally follow this route, possibly with the added logic of renaming past revisions so that file.b#1 is always the newest copy. – Fagin 31/10, 2011 at 17:57

Yes exactly this is what Eagle does. It also delete the oldest file, keeping at max 9 backup files. You can change this. – Bithynia 31/10, 2011 at 18:1

@SalvatorePreviti But is there no more reliable way? Like how Microsoft Visual Studio do save editing of source code files without this approach and changes are immediately committed to disk? (at least I have not come across someone yelling that he got corrupted source code file when there was a sudden power failure at the middle of a Ctrl+S operation in MS Visual Studio). Now I am particular about immediately committing changes every time a file is written to disk immediately. At least I know that even database engines save files to disk. – Themistocles 14/12, 2014 at 16:0

I did come accross people who got corrupted files working with MSVS2017. – Grate 31/8, 2018 at 19:36

B

0

In principle there are two popular approaches to this:

Make your file format log-based, i.e. do not overwrite in the usual save case, just append changes or the latest versions at the end.

or

Write to a new file, rename the old file to a backup and rename the new file into its place.

The first leaves you with (way) more development effort, but also has the advantage of making saves go faster if you save small changes to large files (Word used to do this AFAIK).

Bahia answered 31/10, 2011 at 17:56 Comment(2)

Log-based writes can still result in a corrupted file, if there is only one file in play. Power outages etc can interrupt the actual file-writing while appending the change, which will result in there being no valid EOF. – Fagin 31/10, 2011 at 18:14

But you (or rather, the computer) won't know how much data is valid. The computer may not even know how much data is part of the file. – Fagin 1/11, 2011 at 13:52

Recommended topics

Hot tags