CHECKSUM() collisions in SQL Server 2005
Asked Answered
C

5

10

I've got a table of 5,651,744 rows, with a primary key made of 6 columns (int x 3, smallint, varchar(39), varchar(2)). I am looking to improve the performance with this table and another table which shares this primary key plus an additional column added but has 37m rows.

In anticipation of adding a column to create the hash key, I did an analysis and found 18,733 collisions.

SELECT  SUM(CT)
FROM    (
         SELECT HASH_KEY
               ,COUNT(*) AS CT
         FROM   (
                 SELECT CHECKSUM(DATA_DT_ID, BANK_NUM, COST_CTR_NUM,
                                 GL_ACCT_NUM, ACCT_NUM, APPN_CD) AS HASH_KEY
                 FROM   CUST_ACCT_PRFTBLT
                ) AS X
         GROUP BY HASH_KEY
         HAVING COUNT(*) > 1
        ) AS Y

SELECT  COUNT(*)
FROM    CUST_ACCT_PRFTBLT

It's about twice as bad with BINARY_CHECKSUM()

Does this seem too high (.33%) given the smaller relative amount of the destination space I'm covering? And if the collisions are this high, is there a benefit in joining on this manufactured key first in joins for the cost of the extra 4 bytes per row, given that you still have to join on the regular columns to handle the occasional collision?

Clea answered 22/6, 2009 at 19:42 Comment(3)
How many records are you joining in one go? Does the detail table have a clustered index? How wide? If the clustered index is wide (ie, it includes all the FKs), can you drop it or replace it with an identity column?Leboeuf
Why is it a problem for you? What do you need to accomplish?Aerify
The problem is that I have 200m rows of derived statistics to produce from the 37m rows of statistics and the PIVOT to do the calculations has to pivot on a very large key which resulted in a nasty eager spool of all 37m rows to the tempdb.Clea
O
8

I don't see where adding a checksum will get you anything with that level of collisons. Even 1 collision is too many as it would cause you to join to the wrong data. If you can't guarantee to be joining to the correct record, it is pointless if it improves performance but messes with data integrity. This appears to be financial data, so you had better be really sure that your queries won't return bad results. You could actually end up debiting or crediting the wrong accounts if there are any collisions.

If you do go this route, Marc is right that you should if at all possible pre-compute (Adding a computation that has to happen to every record in multimillion record tables is not likely to improve performance in my experience). Possibly if you can do the precomputed column (and you'll need triggers to keep it up-date) then you may not need to join to all six of the other columns to ensure no collisions. Then possibly you might have imporved performance. All you can do is test your theory. But be very sure you don't have any collisions.

Have you considered using a surrogate key and then a unique index on the six natural key fields instead? Then you could join on the surrogate key and likely that would improve performance a good bit. It can't be efficient to join on six columns (one a varchar) instead of one surrogate key. I realize from the size of the data, this might be harder to refactor than in a non-production system, but really it might be worth the down time to permananently fix persistent performance problems. Only you can say how complex a change this would be and how hard it would be to change all the sps or queries to a better join. However, it might be feasible to try.

Octans answered 22/6, 2009 at 20:52 Comment(4)
I would have to join on the surrgate AND all the PK columns too. The surrogate would need to be the first column in the index (that the optimizer would hopefully choose), but ALL columns would have to be joined. There is an example (just a seek, not a join) in this MSDN documentation: msdn.microsoft.com/en-us/library/ms189788(SQL.90).aspxClea
Why would need to join on the surrogate key and the natural primary key columns? The surrogate key would need to be added to both tables, but you would use it instead of the 6 fields you are currently using in the join.Maladjustment
I see, a true unique surrogate instead of just a hash. Well, unfortunately, the legacy system I am re-engineering does not have RI, so there are actually entries in the 37m row stat table which have no entry in the 5m row PK table. I'll have to think on this.Clea
I ended up going with the true unique surrogate solution as a foreign key into a dimension-like table.Clea
H
7

What I've seen a lot of folks glossing over thus far is that CHECKSUM has a ton of collisions, by Microsoft's own admission. It's even worse than MD5, which has its fair share of meaningful collisions.

If you're looking to get a hash column, consider using HASHBYTES with SHA1 specified. SHA1 has a lot less meaningful collisions than MD5 or CHECKSUM. Therefore, CHECKSUM should never be used to determine if a row is unique, but rather, it's a quick check on the fidelity of two values. Therefore, your collision rate should be 0% with HASHBYTES, unless you have duplicate rows (which, being a PK, should never happen).

Keep in mind that HASHBYTES will truncate anything larger than 8000 bytes, but your PK is a lot less than that (all concatenated), so you shouldn't have any trouble.

Hydrolytic answered 6/7, 2009 at 10:43 Comment(1)
I have refactored the schema to use a true unique surrogate in a dimension table and made this the primary key of the three tables. Performance is much improved.Clea
E
2

If your checksum gets it down to 0.33% of the data, then I'd argue that it is working fine... especially if you use this column in combination with other (indexed) columns.

Of course, to be effective as an index you probably want to compute and store this value when inserting/updating data, with a non-clustered index.

Of course, a regular spanning index over the columns in question may do just as well or better...

Extravasate answered 22/6, 2009 at 19:45 Comment(1)
Yes, I was planning on using a persisted computed column.Clea
L
1

If your queries are selective and the line table clustered index is narrow or non-existent, then a non-clustered index on checksum in the line table should provide good performance.

After applying whatever criteria is present to the header table, it will use the checksum to perform an index seek on the non-clustered index. You still need to include the FKs in the join, but the non-checksum join criteria will be applied post-index seek, post-bookmark lookup. Very efficient.

You want to optimize for the index seek. The checksum is already highly selective. Adding the FKs would increase the index size and corresponding I/O, and wouldn't help unless it included enough other fields to avoid the bookmark lookup altogether.

Since the non-clustered index will contain the clustering keys or heap pointer, you want either a) a small clustering key (eg, an int identity column--4 byte pointer) or b) no clustered index at all (8 byte pointer).

If your queries are not selective, or if the line table clustered index is huge (the entire table minus a few columns) then I don't know if the checksum would help (faster index navigation, perhaps?). In any case you would want to make it a clustered or covering index, and if the header table isn't clustered on the checksum first, there will be much sorting.

If you can afford the storage and indexing costs, a few covering indexes--header and detail--may be the way to go.

Leboeuf answered 24/6, 2009 at 3:24 Comment(0)
N
1

IF your PRIMARY KEY is clustered, then each index you create will contain this PRIMARY KEY.

Joining on a hashed value will use this following steps:

  1. Locate the hashed value in the index key
    • Locate the PRIMARY KEY value in the index data
    • Use Clustered Index Seek to locate the PRIMARY KEY row in the table

Joining on a PRIMARY KEY will use only the step 3.

SQL Server, however, is smart enough to take this into account, and if you will join like this:

SELECT  *
FROM    main_table mt
JOIN    CUST_ACCT_PRFTBLT cap
ON      cap.HASH_KEY = mt.HASH_KEY
        AND cap.DATA_DT_ID = mt.DATA_DT_ID
        AND …
WHERE   mt.some_col = @filter_value

, it just will not use the index on HASH_KEY, instead, it will use a single Clustered Index Seek and a Filter to make sure the hash values match (and they always will).

Summary: just join on the PRIMARY KEY.

Using a secondary index, you'll first need to do a useless HASH_KEY search, and then still need to join on the PRIMARY KEY.

Nightrider answered 24/6, 2009 at 16:0 Comment(1)
Yes, I have avoided too massive a restructuring of this process during this re-engineering, but because the PK is so wide (and clustered), I think I might extract it and use a surrogate instead. In which case, the hash is irrelevant. My main problem is there do end up rows in CUST_ACCT_STAT which have no matching PK in CUST_ACCT_PRFTBLT due to bad RI in the original system, so I will need to infer rows for those as well.Clea

© 2022 - 2024 — McMap. All rights reserved.