Storing Documents as Blobs in a Database - Any disadvantages?
Asked Answered
K

8

56

The requirements for my document management system were:

  1. Must be secure from theft by simple copying of directories, files etc.
  2. Must be secure against traditional virus infection (infection of physical file)
  3. Must be fast to retrieve
  4. The repository must not be visible to casual (directory) browsing users etc.

I have decided to store all documents (and scanned images) as blobs in the database and so far my experience is wonderful and document retrieval is blindingly fast as well - it meets all the criteria from above and there are even a couple of additional advantages, such as autostoring documents together with the entity it relates to, easy and fast seaching of contents, removing of all sorts of user activities around opening and naming of documents etc. etc.

My question is - are there any serious risks or things that I overlooked with this design and implementation?

EDIT Note: DB is PostgreSQL, handles BLOBS very well and scales exceptionally well. The environment is Multi-user.

Kurtz answered 17/10, 2008 at 12:9 Comment(0)
D
42

When your DB grows bigger and bigger it will become harder to backup. Restoring a backup of a table with over 100 GB of data is not something that makes you happy.

Another thing that get is that all the table management functions get slower and slower as the dataset grows.
But this can be overcome by making your data table just contain 2 fields: ID and BLOB.

Retrieving data (by primary key) will likely only become a problem long after you hit a wall with backing up the dataset.

Discourteous answered 18/10, 2008 at 15:3 Comment(3)
As with any large dataset, have a server that you put in and out of replication to take snapshots of the database for backup. How would this be any different with BLOBs?Trypanosomiasis
There is no difference between images vs any other BLOB data. Still, moving the BLOB data into its own table speeds up reading the other colums, because the blob data does not need to be referenced/loaded into memory. Also, most web developments do not have big BLOB data, other than images.Discourteous
@Discourteous Every Unicode string longer than 1000 characters requires an CLOB on Oracle, because Oracle stores Unicode with 4 byte and every value must be smaller than 4k. It is very easy to exceed this limitation. We need CLOBs for unparsed XML data and BLOBs for certificates.Zygodactyl
S
32

The main disadvantage that I often hear of using blobs is that, above a certain size, the file system is much more efficient at storing and retrieving large files. It sounds like you've already taken this in to account by your list of requirements.

There's a good reference (PDF) here that covers the pros and cons of blobs.

Sartin answered 17/10, 2008 at 12:16 Comment(0)
F
13

From my experience, some issues were:

  1. speed vs having files on the file system.

  2. caching. IMO the web server will do a better job of caching static contents. The DB will do a good job too, but if the DB is also handing all sorts of other queries, don't expect those large documents to stay cached for long. You essentially have to transfer the files twice. Once from the DB to the Web server, and then web server to client.

  3. Memory constraints. At my last job we had an 40MB PDF in the database, and kept getting Java OutOfMemoryErrors in the log file. We eventually realized that the entire 80MB PDF was read into the heap not just once, but TWICE thanks to a setting in Hibernate ORM (if an object is mutable, it makes a copy for editing in memory). Once the PDF was streamed back to the user, the heap was cleaned up, but it was a big hit to suck 80MB out of the heap at once just to stream a document. Know your code and how memory is being used!

Your web server should be able to handle most of your security concerns, but if documents are small and the DB isn't already under a big load, then I don't really see a big issue with having them in the DB.

Fledge answered 18/10, 2008 at 15:46 Comment(1)
Documents will remain relative small, but I will keep this in mind, maybe having two databases on seperate servers or something like that.Revet
O
4

I've just started researching SQL Server 2008's FILESTREAMing for BLOBs and have run across a HUGE limitation (IMO)--it only works with integrated security. If you don't use Windows Authentication to connect to the DB server, you're unable to read/write the BLOBs. Many application environments can't use windows authentication. Certainly not in heterogeneous environments.

A better solution for storing BLOBs must exist. What are the best practices?

Ogre answered 17/10, 2008 at 12:9 Comment(0)
H
2

This article covers most of the issues. If you are using SQL Server 2008, check out the use of the new FILESTREAM type as discussed by Paul Randal here.

Hypotrachelium answered 17/10, 2008 at 12:12 Comment(0)
C
2

It depends on the databasetype. Oracle or SQLServer? Be aware of one disadvantage - restore of a single document.

Collier answered 17/10, 2008 at 12:22 Comment(0)
S
0

From what I have experienced storing content files as blobs, in both SQL Server and Oracle, works OK with a small database and with a low number of logged in users. ECM system separate them and use separate services for streaming content. Depending on the size of the files, the server resources can be impacted with simultaneous retrieval of large files. Archive of databases with large sets of files becomes problematic due to the time to restore and the inability to retrieve documents form the archive.

If these files are corporate records, and this is the authoritative copy of the records, you may have compliance and retention management issues, especially if you archive the files. Also search and version control may become a huge issue moving forward.

You may want to investigate an ECM system with an API of some sort, rather than re-inventing the wheel.

Solfatara answered 17/10, 2008 at 12:9 Comment(0)
M
-1

Sorry - the answer I offered was based on SQL Server, so the maintenance portion is not appropriate. But file I/O is accomplished at the hardware level and any database adds extra processing steps.

The database will impose extra overhead when retrieving the document. When the file is on disk you are only as slow or as fast as the I/O on the server. You certainly should manage your meta in a database, but in the end you want the UNC of the file and point the user to the source and get out of the way.

From a maintenance and administration perspective you will limit yourself to a SAN when dealing with MS SQL Server. Solutions like Documentum take a different approach with simple storage on the disk and allows you to implement a storage solution as you see fit.

EDIT

Let me clarify my statement - with SQL Server you have limited options when you exceed the physical storage capacity of the box. This is in fact one of the big weaknesses of Sharepoint that you are not able to simply attach any type of network storage.

Merganser answered 17/10, 2008 at 12:17 Comment(1)
Mitch: The database imposes extra network connections as opposed to the I/O calls for a local file. The time difference may be noticable, particularly if you can use sendfile() for I/O. (sendfile() info: articles.techrepublic.com.com/5100-10878_11-1044112.html )Stansbury

© 2022 - 2024 — McMap. All rights reserved.