SQL Server: Clustering by timestamp; pros/cons
Asked Answered
A

4

5

I have a table in SQL Server, where i want inserts to be added to the end of the table (as opposed to a clustering key that would cause them to be inserted in the middle). This means I want the table clustered by some column that will constantly increase.

This could be achieved by clustering on a datetime column:

CREATE TABLE Things (
    ...
    CreatedDate datetime DEFAULT getdate(),
    [timestamp] timestamp,        

    CONSTRAINT [IX_Things] UNIQUE CLUSTERED (CreatedDate)
)

But I can't guaranteed that two Things won't have the same time. So my requirements can't really be achieved by a datetime column.

I could add a dummy identity int column, and cluster on that:

CREATE TABLE Things (
    ...
    RowID int IDENTITY(1,1),
    [timestamp] timestamp,        

    CONSTRAINT [IX_Things] UNIQUE CLUSTERED (RowID)
)

But you'll notice that my table already constains a timestamp column; a column which is guaranteed to be a monotonically increasing. This is exactly the characteristic I want for a candidate cluster key.

So I cluster the table on the rowversion (aka timestamp) column:

CREATE TABLE Things (
    ...
    [timestamp] timestamp,        

    CONSTRAINT [IX_Things] UNIQUE CLUSTERED (timestamp)
)

Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.

What I'm looking for are thoughts of why this is a bad idea; and what other ideas are better.

Note: Community wiki, since the answers are subjective.

Athamas answered 21/4, 2010 at 17:20 Comment(7)
A timestamp would be a really bad idea as it is updated whenever the row is updated!Lagerkvist
Is this MS SQL? If so, clustered indexes don't have to be unique so just drop that from the constraint.Decode
@Martin Smith: Oh damn, i forgot about that! You should put that in an answer, so i can accept it as a "oh god don't do this".Athamas
@Chris Hass: If the clutsered index isn't marked as unique, SQL Server will add another 4-byte "uniqueifier". There's no point in wasting another 4-bytes per row, when the 8-byte timestamp already is unique.Athamas
@Ian Boyd, that's only if you don't have a unique index declared elsewhere, right?Decode
Yes. Basically SQL Server needs a short primary key for references from secondary indices.Califate
@Chris Haas If the clustered index is not unique, SQL Server will make it unique by adding another 4-bytes to it. The clustered index uniquely points to a specific row. Other indexes use the clustered index to point to a specific row. One way or another the clustered index has to be unique.Athamas
T
10

So I cluster the table on the rowversion (aka timestamp) column: Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.

That might sound like a good idea at first - but it's really almost the worst option you have. Why?

The main requirements for a clustered key are (see Kim Tripp's blog post for more excellent details):

  • stable
  • narrow
  • unique
  • ever-increasing if possible

Your rowversion violates the stable requirement, and that's probably the most important one. The rowversion of a row changes with each modification to the row - and since your clustering key is being added to each and every non-clustered index in the table, your server will be constantly updating loads of non-clustered indices and wasting a lot of time doing so.

In the end, adding a dummy identity column probably is a much better alternative for your case. The second best choice would be the datetime column - but here, you do run the risk of SQL Server having to add "uniqueifiers" to your entries when duplicates occur - and with a 3.33ms accuracy, this could definitely be happening - not optimal, but definitely much better than the rowversion idea...

Thrift answered 21/4, 2010 at 17:20 Comment(1)
+1 for the answer. And your answer should have another +1 for the opening sentence; it's really perfectly worded to draw in the reader. And it's tactful and polite.Athamas
A
2

from the link: timestamp in the question:

The timestamp syntax is deprecated. This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.

and

Duplicate rowversion values can be generated by using the SELECT INTO statement in which a rowversion column is in the SELECT list. We do not recommend using rowversion in this manner.

so why on earth would you want to cluster by either, especially since their values alwsys change when the row is updated? just use an identity as the PK and cluster on it.

Arie answered 21/4, 2010 at 17:20 Comment(1)
rowversion is the new name for timestamp. i specifically decided to use the term timestamp since more people will know what i'm talking about. But the two are identical. And duplicate rowversions can only be inserted if you're inserting duplicate rowversions. Which in reality nobody ever does. In the end your final throw away statement is a valid reason: they change on updates, which would shuffle rows around.Athamas
J
1

You were on the right track already. You can use a DateTime column that holds the created date and create a CLUSTERED but non unique constraint.

CREATE TABLE Things (
    ...
    CreatedDate datetime DEFAULT getdate(),
    [timestamp] timestamp,        
)

CREATE CLUSTERED INDEX [IX_CreatedDate] ON .[Things] 
(
    [CreatedDate] ASC
)
Jacobinism answered 21/4, 2010 at 17:20 Comment(0)
C
0

If this table gets a lot of inserts, you might be creating a hot spot that interferes with updates, because all of the inserts will be happening on the same physical/index pages. Check your locking setup.

Centrist answered 21/4, 2010 at 17:20 Comment(3)
hotspots really aren't a big issue anymore, ever since SQL Server 7.0Thrift
Granted, it's not a big concern, but if he wants to force all of his inserts into one small part of the table it's something to be aware of.Centrist
The real point is that i don't want to cluster by a newid() uniqueidentifier. i don't want to cluster by other values for the same reason.Athamas

© 2022 - 2025 — McMap. All rights reserved.