Hashing more than 8000 bytes in SQL Server

H

3

7

SQL Server's hashing function HASHBYTES has an input limit of 8000 bytes.

How do you hash larger strings?

Houseman answered 14/10, 2011 at 15:54 Comment(0)

P

5

You could hash 8k (or 4k or 2k) chunks of the input and then either concatenate those hashes or hash them into a new hash value. This might get difficult though if you have to create a similar algorithm (in an external .NET app for example) to compare hashes created outside of SQL Server.

Another option: Lean on SQL Server's CLR integration and perform the hashing in a .NET assembly.

Pursuance answered 14/10, 2011 at 15:59 Comment(3)

thanks Paul, it's a really good solution for internal use. but in this case it has to be totally compliant with sha1 – Houseman 14/10, 2011 at 16:28

@SDReyes: Another idea added to my answer: Lean on SQL Server's CLR integration and perform the hashing in a .NET assembly. – Pursuance 14/10, 2011 at 16:30

the CLR integration link in this answer is broken. – Pillbox 13/10, 2020 at 19:29

M

8

You could write a SQL CLR function:

[Microsoft.SqlServer.Server.SqlFunction]
public static SqlBinary BigHashBytes(SqlString algorithm, SqlString data)
{
    var algo = HashAlgorithm.Create(algorithm.Value);

    var bytes = Encoding.UTF8.GetBytes(data.Value);

    return new SqlBinary(algo.ComputeHash(bytes));
}

And then it can be called in SQL like this:

--these return the same value
select HASHBYTES('md5', 'test stuff')
select dbo.BigHashBytes('md5', 'test stuff')

The BigHashBytes is only necessary if the length would be over 8k.

Marelda answered 17/2, 2012 at 11:49 Comment(3)

Careful with CLR SP parameters being silently truncated to 8000 bytes - I had to tag the parameter with [SqlFacet(MaxSize = -1)] otherwise bytes after the 8000th would simply be ignored! Took me a while to get that one! – Flagitious 6/2, 2013 at 12:42

-1 for misleading readers (unintentionally, of course) into using UTF-8. SQL Server (and Windows in general) uses UTF-16. Hence this will appear to work for many people who are currently only using Code Points 0 - 127, or possibly up to 256. But using UTF-8 leads to problems such as this Question: SQL CLR function based on .net ComputeHash is not working with Cyrrilic – Oldworld 8/2, 2016 at 16:3

@Flagitious Your experience was due to how older versions of SSDT generated the T-SQL wrapper objects. The default used to be to use NVARCHAR(4000) for SqlString and NVARCHAR(MAX) for SqlChars. But since around the time of Visual Studio 2013, the default was changed to use NVARCHAR(MAX) for both. Still, it is always better to be explicit and use [SqlFacet()] as you mention, but folks using the newer SSDT versions wouldn't run into that. Also, SQLCLR only support NVARCHAR so it would actually truncate at 4000 chars :-). – Oldworld 8/2, 2016 at 16:7

P

5

You could hash 8k (or 4k or 2k) chunks of the input and then either concatenate those hashes or hash them into a new hash value. This might get difficult though if you have to create a similar algorithm (in an external .NET app for example) to compare hashes created outside of SQL Server.

Another option: Lean on SQL Server's CLR integration and perform the hashing in a .NET assembly.

Pursuance answered 14/10, 2011 at 15:59 Comment(3)

thanks Paul, it's a really good solution for internal use. but in this case it has to be totally compliant with sha1 – Houseman 14/10, 2011 at 16:28

@SDReyes: Another idea added to my answer: Lean on SQL Server's CLR integration and perform the hashing in a .NET assembly. – Pursuance 14/10, 2011 at 16:30

the CLR integration link in this answer is broken. – Pillbox 13/10, 2020 at 19:29

N

2

Like Paul's idea, one idea that comes to mind for chunking would be to store the hashed string in an XML column, with each chunk as a separate XML element.

Nonfulfillment answered 14/10, 2011 at 16:2 Comment(1)

+1 Good idea, though for data integrity's sake I would suggest a single hash value be stored (a hash of hashes.) With an XML structure you would need an extra attribute that maps a hash value to the index of the Nth chunk of data that was hashed. – Pursuance 14/10, 2011 at 16:8

Recommended topics

Hot tags