What is the benefit of having varbinary field in a separate 1-1 table?

Asked 25/9, 2009 at 14:39 Answered 6/3, 2017 at 20:8

Solved sql-server sql-server-2005 performance database-design varbinary

I need to store binary files in a varbinary(max) column on SQL Server 2005 like this:

FileInfo

FileInfoId int, PK, identity
FileText varchar(max) (can be null)
FileCreatedDate datetime etc.

FileContent

FileInfoId int, PK, FK
FileContent varbinary(max)

FileInfo has a one to one relationship with FileContent. The FileText is meant to be used when there is no file to upload, and only text will be entered manually for an item. I'm not sure what percentage of items will have a binary file.

Should I create the second table. Would there be any performance improvements with the two table design? Are there any logical benefits?

I've found this page, but not sure if it applies in my case.

Frumentaceous answered 25/9, 2009 at 14:39 Comment(0)

There is no performance nor operational advantage. Since SQL 2005 the LOB types are already stored for you by the engine in a separate allocation unit, a separate b-tree. If you study the Table and Index Organization of SQL Server you'll see that every partition has up to 3 allocation units: data, LOB and row-overflow:

_{(source: s-msft.com)}

A LOB field (varchar(max), nvarchar(max), varbinary(max), XML, CLR UDTs as well as the deprecated types text, ntext and image) will have in the data record itself, in the clustered index, only a very small footprint: a pointer into the LOB allocation unit, see Anatomy of a Record.

By storing a LOB explicitly in a separate table you gain absolutely nothing. You just add unneeded complexity as former atomic updates have to distribute themselves now into two separate tables, complicating the application and the application transaction structure.

If the LOB content is an entire file then perhaps you should consider upgrade to SQL 2008 and using FILESTREAM.

Marcel answered 25/9, 2009 at 15:17 Comment(5)

Thanks! This is exactly the info I was looking for. It's clear now. But I wonder... Perhaps the 1-1 separation method goes back from the days of SQL Server 2000 when varbinary(max) didn't yet exist in the form it does in SQL Server 2005, and text, ntext and image were used for storing binary files? Why else would people have that idea? Or is it really only a principle to prevent people to make the SELECT * mistake? But they can still use a join :) – Ahmadahmar 25/9, 2009 at 15:54

+1 Remus, I forgot about this 2005 and up feature. Regarding FILESTREAM in 2008, I wouldn't rush using this for any respectably sized repository. At the cost of having to maintain the referential integrity of the "header data" in relation to a plain file-system repository, this type of approach does offer several operational advantages in terms of scaling, backup, mirroring, TCO... – Winstead 25/9, 2009 at 16:58

@mjv: FILESTREAM has some advantages if the file has to be accessed from a Win32 API (eg. document integration like SharePoint). Backup/Restore contain filestream content, but is true that it does not work with mirroring. As any new feature, the tricks of how to scale it up will take a while to percolate into common knowledge, but I know it can be scaled. On TCO, I'm note sure I understant what cost are you reffering to. Deployment setup cost and training perhaps? – Marcel 25/9, 2009 at 17:15

@Dragoljub: Frankly, I'm not sure how that advice about separation of LOBs got started. Pre 2005 the text/ntext/image fields were still stored outside the table (a mess of a hairball storage, but not part of the clustered leaf pages, so no part of a scan/seek). I can only think at DBAs trying to avoid the effect of * in projection lists. – Marcel 25/9, 2009 at 17:20

Your GIF image link looks dead. – Sledgehammer 15/6, 2012 at 0:42

There is no real logical advantage to this two-tables design, since the relationship is 1-1, you might have all the info bundled in the FileInfo table. However, there are serious operational and performance advantages, in particular if your binary data is more than a few hundred bytes in size, on average.

EDIT: As pointed out by Remus Rusanu, on some DBMS implementations such as SQL2005, the large object types are transparently stored to a separate table, effectively alleviating the practical drawback of having big records. The introduction of this feature implicitly confirms the the [true] single table approach's weakness.

I merely scanned the SO posting referenced in this question. I generally thing that while that other posting makes a few valid points, such as intrinsic data integrity (since all CRUD actions on a given item are atomic), but on the whole, and unless of relatively atypical use cases (such as using the item table as a repository mostly queried for single items at a time), the performance advantage is with the two tables approach (whereby indexes on "header" table will be more efficient, queries that do not require the binary data will return much more quickly etc. etc.)

And the two tables approach has further benefits in case the design evolves to supply different types of binary objects in differnt context. For example, say these items are images (GIFs, JPGs etc.). At a later date you want to also provide a small preview version of these images (and/or a hi-resolution version), the choice of this being driven by the context (user preference, low band-width clients, subscriber vs. visitor etc.). In such a case not only are the operational issues associated with the single table approach made more acute, the model becomes more versatile.

Winstead answered 25/9, 2009 at 14:43 Comment(5)

I can only see a problem if you tend to use SELECT * habitually instead of SELECTing only the columns you need. – Fraudulent 25/9, 2009 at 14:46

Which is bad practice in itself. Fix for that is to stop using SELECT *, not change the database design. – Dru 25/9, 2009 at 14:58

One can argue that the single table design is good because the slow queries associated with [unnecessary] SELECT * will eventually force the developer to avoid the * practice. :-) I think however that my opinion favoring the two-tables approach holds independently of the SELECT * practice. – Winstead 25/9, 2009 at 15:12

I agree about SELECT *, and I never use it. I guess what I was aiming at is exactly related to indexes and reading of the other columns, would that consume less cpu and memory if we had the varbinary column in a separate table? Let's assume I don't include the varbinary column in the select statement. – Ahmadahmar 25/9, 2009 at 15:50

Tablescans would have to scan over the lob descriptors which might be a minor impact, for index scans that’s less of an issue (only uses less cache space) – Bloodsucker 28/2, 2018 at 6:25

It can help to separate IMAGE, (N)TEXT, (N)VARCHAR(max) and VARBINARY(max) columns out of wider tables purely for some restrictions of SQL Server.

For example before 2012 it was not possible to online rebuild a clustered table if it contained LOBs. On the other hand you might not care about those restrictions, so setting up the table like your data is related is the better thing to do.

In case you physically want to keep the LOB data out of the table allocation unit you still can set the "large value types out of row" table option.

Bloodsucker answered 6/3, 2017 at 20:8 Comment(0)

FileInfo

FileContent

Recommended topics

Hot tags