Here are the options:
Using index on hash as VARBINARY
Using BINARY_CHECKSUM and CHECKSUM
- It’s good but the problem is there is a high chance of duplication in checksum and as you google you see that many people have a problem with it.
However, there is a small chance that the checksum will not change.
For this reason, we do not recommend using CHECKSUM to detect whether
values have changed unless your application can tolerate occasionally
missing a change. Consider using HashBytes instead. When an MD5 hash
algorithm is specified, the probability of HashBytes returning the
same result for two different inputs is much lower than that of
CHECKSUM.
Source: https://msdn.microsoft.com/en-us/library/ms189788(v=SQL.100).aspx
- Casting HASBYTES to BIGINT and having index on that
I would also be careful about converting the hashed value to BIGINT
given that BIGINT is only 8 bytes yet all hash algorithms -- even MD5
-- are greater than 8 bytes (MD5 = 16 bytes, SHA1 = 20, SHA2_256 = 32, and SHA2_512 = 64). And converting binary values larger than 8 bytes
to BIGINTsilently truncates the values. Hence you lose accuracy and
increasing occurrences of false positives. The following query shows
this behavior:
SELECT CONVERT(BIGINT, 0xFFFFFFFFFFFFFF), -- 7 bytes = 72057594037927935
CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFF), -- 8 bytes = -1
CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFF), -- 9 bytes = -1
CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFFFF) -- 10 bytes = -1
Source: https://dba.stackexchange.com/questions/154945/index-maintenance-for-varbinary
- Casting HASHBYTES to VARCHAR and having index on that
- This is the good choice
- You have two options:
a) If you're using SQL 2008 or above
SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', CONTENT),2)
b) If you're using SQL 2005
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', CONTENT)), 3, 32)
PS: If you wonder which Hash Algorithm you should use:
MD5 = 16 bytes
SHA1 = 20 bytes
SHA2_256 = 32 bytes
SHA2_512 = 64 bytes
Source: https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/
For your second question, you should make Hash columns PERSISTED, to avoid the impact on running each query.
concat('abc', 'd')
&concat('a', 'bcd')
will be the same. – Hegemony