Binary_Checksum Vs HashBytes function
Asked Answered
B

1

13

I have a complex query which uses a lot of binary checksum function, when I was testing it with some test data for two distinct records it actually returned me same checksum value. Please find the test data I used below

SELECT BINARY_CHECKSUM(16   ,'EP30461105',1) AS BinaryCheckSumEx UNION ALL
SELECT BINARY_CHECKSUM(21   ,'EP30461155',1) AS BinaryCheckSumEx

Now I am trying to use HASHBYTES function with 'MD5' algorithm for which I can be certain to get unique records, but what concerns me now is that in the current query I use the 'Checksum' value to join in my 'Merge' statements to look for new records. Since 'HashBytes' returns me Varbinary data type how much of a performance overhead I can expect when I replace the join conditions with the 'HashByte' field.

SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))

And moreover I need to create hashing for multiple columns in which case I need to have an additional Concat function will this have additional overhead to my performance.

Bennir answered 3/4, 2017 at 11:20 Comment(6)
Tag the dbms you're using. That functionality is product specific.Urdar
A checksum is not a cryptographic hash and is not useful for uniquely identifying data. Also consider the hash of concat('abc', 'd') & concat('a', 'bcd') will be the same.Hegemony
Are you positive the query without hashing - i.e. one that can be optimised and leverage indexes is actually less efficient than hashing and comparing?Hegemony
@AlexK. I am quite certain that I will not expect such concat scenario of multiple columns when concatenated returning the same value.Bennir
@AlexK. The main reason why we are trying to use a hash or a checksum value is to use it in a merge statement to find if any new record (based on a group fields in target table) is created. If I switch to HashBytes I would have to update the join condition field from 'int' to 'varbinary' how much will this impact my performance is my query.Bennir
Instead of using concat(Colums1, Column2), You should use always concat(Colums1, '|', Column2). In this case you don't have the problem of concat('abc', 'd') & concat('a', 'bcd')Trek
A
8

Here are the options:

  1. Using index on hash as VARBINARY

  2. Using BINARY_CHECKSUM and CHECKSUM

    • It’s good but the problem is there is a high chance of duplication in checksum and as you google you see that many people have a problem with it.

However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Source: https://msdn.microsoft.com/en-us/library/ms189788(v=SQL.100).aspx

  1. Casting HASBYTES to BIGINT and having index on that
    • It’s not a good idea

I would also be careful about converting the hashed value to BIGINT given that BIGINT is only 8 bytes yet all hash algorithms -- even MD5 -- are greater than 8 bytes (MD5 = 16 bytes, SHA1 = 20, SHA2_256 = 32, and SHA2_512 = 64). And converting binary values larger than 8 bytes to BIGINTsilently truncates the values. Hence you lose accuracy and increasing occurrences of false positives. The following query shows this behavior:

SELECT CONVERT(BIGINT, 0xFFFFFFFFFFFFFF),      --  7 bytes = 72057594037927935
       CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFF),    --  8 bytes = -1
       CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFF),  --  9 bytes = -1
       CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFFFF) -- 10 bytes = -1

Source: https://dba.stackexchange.com/questions/154945/index-maintenance-for-varbinary

  1. Casting HASHBYTES to VARCHAR and having index on that
    • This is the good choice
    • You have two options:

a) If you're using SQL 2008 or above

SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', CONTENT),2)

b) If you're using SQL 2005

SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', CONTENT)), 3, 32)

PS: If you wonder which Hash Algorithm you should use:

MD5 = 16 bytes
SHA1 = 20 bytes
SHA2_256 = 32 bytes
SHA2_512 = 64 bytes

enter image description here

Source: https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/

For your second question, you should make Hash columns PERSISTED, to avoid the impact on running each query.

Applicant answered 20/6, 2017 at 14:41 Comment(6)
Regarding #1: Using index on hash as VARBINARY - What if you use BINARY? ... Hashbytes returns fix size result so binary(16) is enough for an MD5 hash result. Wouldn't it perform better than any of your options listed above?Argument
The source from the 2nd point is for CHECKSUM function. Shouldn't it be linked to BINARY_CHECKSUM? Maybe the caveat you mentioned about the "high chance of duplication" also need to be updated accordingly.Theatricals
@user2286046 Hi, Thanks for your comment. The same logic applied to BINARY_CHECKSUM . just google binary_checksum duplicates I updated the 2nd option to clarifyApplicant
Correct me if I wrong, but if you are mentioning about this article, I think they are wrong. They used BINARY_CHECKSUM for nvarchar type, but this is clearly mentioned in this official article: BINARY_CHECKSUM supports any length of type varbinary(max) and up to 255 characters of type nvarchar(max). Thus, BINARY_CHECKSUM is a valid option for varbinary.Theatricals
@user2286046 Based on the link that you sent: However, this change isn't guaranteed, and so to detect whether values have changed, we recommend use of BINARY_CHECKSUM only if your application can tolerate an occasional missed changeApplicant
You are right. I'm Sorry for giving misinformation.Theatricals

© 2022 - 2024 — McMap. All rights reserved.