Keeping my database and file system in sync

Asked 15/3, 2013 at 18:57 Answered 7/8, 2013 at 10:6

I'm working on a piece of software that stores files in a file system, as well as references to those files in a database. Querying the uploaded files can thus be done in the database without having to access the file system. From what I've read in other posts, most people say it's better to use a file system for file storage rather then storing binary data directly in a database as BLOB.

So now I'm trying to understand the best way to set this up so that both the database a file system stay in sync and I don't end up with references to files that don't exist, or files taking up space in the file system that aren't referenced. Here are a couple options that I'm considering.

Option 1: Add File Reference First

//Adds a reference to a file in the database
database.AddFileRef("newfile.txt"); 

//Stores the file in the file system
fileStorage.SaveFile("newfile.txt",dataStream);

This option would be problematic because the reference to the file is added before the actual file, so another user may end up trying to download a file before it is actually stored in the system. Although, since the reference to the the file is created before hand the primary key value could be used when storing the file.

Option 2: Store File First

//Stores the file
fileStorage.SaveFile("newfile.txt",dataStream); 

//Adds a reference to the file in the database
//fails if reference file does not existing in file system
database.AddFileRef("newfile.txt");

This option is better, but would make it possible for someone to upload a file to the system that is never referenced. Although this could be remedied with a "Purge" or "CleanUpFileSystem" function that deletes any unreferenced files. This option also wouldn't allow the file to be stored using the primary key value from the database.

Option 3: Pending Status

//Adds a pending file reference to database
//pending files would be ignored by others
database.AddFileRef("newfile.txt"); 

//Stores the file, fails if there is no 
//matching pending file reference in the database
fileStorage.SaveFile("newfile.txt",dataStream); database

//marks the file reference as committed after file is uploaded
database.CommitFileRef("newfile.txt");

This option allows the primary key to be created before the file is uploaded, but also prevents other users from obtaining a reference to a file before it is uploaded. Although, it would be possible for a file to never be uploaded, and a file reference to be stuck pending. Yet, it would also be fairly trivial to purge pending references from the database.

I'm leaning toward option 2, because it's simple, and I don't have to worry about users trying to request files before they are uploaded. Storage is cheap, so it's not the end of the world if I end up with some unreferenced files taking up space. But this also seems like a common problem, and I'd like to hear how others have solved it or other considerations I should be making.

History answered 15/3, 2013 at 18:57 Comment(1)

Very intelligent question. Many people never think about consistency between their disparate data stores. – Swithin 15/3, 2013 at 19:1

I want to propose another option. Make the filename always equal to the hash of its contents. Then you can safely write any content at all times provided that you do it before you add a reference to it elsewhere.

As contents never change there is never a synchronization problem.

This gives you deduplication for free. Deletes become harder though. I recommend a nightly garbage collection process.

Swithin answered 15/3, 2013 at 19:0 Comment(2)

Could you elaborate? I would get the hash code from the file, and use that code to determine how the file is stored in the file system? Then the database while store the reference to the file as the hash code rather then the file name? Wouldn't I then have to deal with the potential of collisions? – History 15/3, 2013 at 19:9

If you use a standard cryptographic hash function you don't have to deal with collisions at all (if you did you'd have won the lottery 10 times already). The old MD5 algorithm is good enough, built in everywhere and one of the fastest algorithms.; You would first determine the hash, then derive a file name from it (tohex(hashbytes) + ".dat") and write it. Then you store the hash (or file name) in the database. Done. – Swithin 15/3, 2013 at 19:30

What is the real use of the database? If it's just a list of files, I don't think you need it at all, and not having it saves you the hassle of synchronising.

If you are convinced you need it, then options 1 and 2 are completely identical from a technical point of view - the 2 resources can be out of sync and you need a regular process to consolidate them again. So here you should choose the options that suits the application best.

Option 3 has no advantage whatsoever, but uses more resources.

Note that using hashes, as suggested by usr, bears a theoretical risk of collision. And you'd also need a periodical consolidation process, as for options 1 and 2.

Another questions is how you deal with partial uploads and uploads in progress. Here option 2 could be of use, but you could also use a second "flag" file that is created before the upload starts, and deleted when the upload is done. This would help you determine which uploads have been aborted.

Rompers answered 15/3, 2013 at 19:15 Comment(1)

There is other non-binary information stored in the database that is releated to the files, so yes I need it. – History 15/3, 2013 at 19:45

To remedy the drawback you mentioned of option 1 I use something like fileStorage.FileExists("newfile.txt"); and filter out the result for which it returns a negative.

In Python lingo:

import os
op = os.path

filter(lambda ref: op.exists(ref.path()), database.AllRefs())

Radiotelegraph answered 7/8, 2013 at 10:6 Comment(0)

Recommended topics

Hot tags