TSQL md5 hash different to C# .NET md5
Asked Answered
C

4

22

I've generated an md5 hash as below:

DECLARE @varchar varchar(400) 

SET @varchar = 'è'

SELECT CONVERT(VARCHAR(2000), HASHBYTES( 'MD5', @varchar ), 2)

Which outputs:

785D512BE4316D578E6650613B45E934

However generating an MD5 hash using:

System.Text.Encoding.UTF8.GetBytes("è")

generates:

0a35e149dbbb2d10d744bf675c7744b1

The encoding in the C# .NET method is set to UTF8 and I had assumed that varchar was also UTF8, any ideas on what I'm doing wrong?

Coaction answered 12/1, 2015 at 18:29 Comment(5)
Why did you make that assumption? What is the difference between varchar and nvarchar? You can use .NET code in SQL Server by using a CLR SQL Server User-Defined Function.Triggerhappy
@AndrewMorton Many thanks for the CLR SQL suggestion, just brilliant! I think I remember reading it on a forum somewhere, my bad! :-)Coaction
@AndrewMorton and Imran, the problem isn't the SQL Server handling of any of this. It is an issue of not using the correct Encoding on the .NET side. I explain in detail in my answer. Using SQLCLR is not going to help here as all string data in the SQLCLR API is NVARCHAR, no VARCHAR. However, if you want to play with it without coding anything, you can get the free SQL# library (which I wrote) and check out the Util_Hash and Util_HashBinary functions. These will equate: SELECT SQL#.Util_HashBinary('md5', CONVERT(VARBINARY(50), 'è')), HASHBYTES('MD5', 'è');Thermosphere
@srutzky Sorry for the late response, thank you so much for such an amazing answer, I really appreciate it. It's people like you who make the internet an amazing place!Coaction
see also #35200952Murphree
T
47

If you are dealing with NVARCHAR / NCHAR data (which is stored as UTF-16 Little Endian), then you would use the Unicode encoding, not BigEndianUnicode. In .NET, UTF-16 is called Unicode while other Unicode encodings are referred to by their actual names: UTF7, UTF8, and UTF32. Hence, Unicode by itself is Little Endian as opposed to BigEndianUnicode. UPDATE: Please see the section at the end regarding UCS-2 and Supplementary Characters.

On the database side:

SELECT HASHBYTES('MD5', N'è') AS [HashBytesNVARCHAR]
-- FAC02CD988801F0495D35611223782CF

On the .NET side:

System.Text.Encoding.ASCII.GetBytes("è")
// D1457B72C3FB323A2671125AEF3EAB5D

System.Text.Encoding.UTF7.GetBytes("è")
// F63A0999FE759C5054613DDE20346193

System.Text.Encoding.UTF8.GetBytes("è")
// 0A35E149DBBB2D10D744BF675C7744B1

System.Text.Encoding.UTF32.GetBytes("è")
// 86D29922AC56CF022B639187828137F8

System.Text.Encoding.BigEndianUnicode.GetBytes("è")
// 407256AC97E4C5AEBCA825DEB3D2E89C

System.Text.Encoding.Unicode.GetBytes("è")  // this one matches HASHBYTES('MD5', N'è')
// FAC02CD988801F0495D35611223782CF

However, this question pertains to VARCHAR / CHAR data, which is ASCII, and so things are a bit more complicated.

On the database side:

SELECT HASHBYTES('MD5', 'è') AS [HashBytesVARCHAR]
-- 785D512BE4316D578E6650613B45E934

We already see the .NET side above. From those hashed values there should be two questions:

  • Why don't any of them match the HASHBYTES value?
  • Why does the "sqlteam.com" article linked in @Eric J.'s answer show that three of them (ASCII, UTF7, and UTF8) all match the HASHBYTES value?

There is one answer that covers both questions: Code Pages. The test done in the "sqlteam" article used "safe" ASCII characters that are in the 0 - 127 range (in terms of the int / decimal value) that do not vary between Code Pages. But the 128 - 255 range -- where we find the "è" character -- is the Extended set that does vary by Code Page (which makes sense as this is the reason for having Code Pages).

Now try:

SELECT HASHBYTES('MD5', 'è' COLLATE SQL_Latin1_General_CP1255_CI_AS) AS [HashBytes]
-- D1457B72C3FB323A2671125AEF3EAB5D

That matches the ASCII hashed value (and again, because the "sqlteam" article / test used values in the 0 - 127 range, they did not see any changes when using COLLATE). Great, now we finally found a way to match VARCHAR / CHAR data. All good?

Well, not really. Let's take a look-see at what we were actually hashing:

SELECT 'è' AS [TheChar],
       ASCII('è') AS [TheASCIIvalue],
       'è' COLLATE SQL_Latin1_General_CP1255_CI_AS AS [CharCP1255],
       ASCII('è' COLLATE SQL_Latin1_General_CP1255_CI_AS) AS [TheASCIIvalueCP1255];

Returns:

TheChar TheASCIIvalue   CharCP1255  TheASCIIvalueCP1255
è       232             ?           63

A ? ? Just to verify, run:

SELECT CHAR(63) AS [WhatIs63?];
-- ?

Ah, so Code Page 1255 doesn't have the è character, so it gets translated as everyone's favorite ?. But then why did that match the MD5 hashed value in .NET when using the ASCII encoding? Could it be that we weren't actually matching the hashed value of è, but instead were matching the hashed value of ?:

SELECT HASHBYTES('MD5', '?') AS [HashBytesVARCHAR]
-- 0xD1457B72C3FB323A2671125AEF3EAB5D

Yup. The true ASCII character set is just the first 128 characters (values 0 - 127). And as we just saw, the è is 232. So, using the ASCII encoding in .NET is not that helpful. Nor was using COLLATE on the T-SQL side.

Is it possible to get a better encoding on the .NET side? Yes, by using Encoding.GetEncoding(Int32), which allows for specifying the Code Page. The Code Page to use can be discovered using the following query (use sys.columns when working with a column instead of a literal or variable):

SELECT sd.[collation_name],
       COLLATIONPROPERTY(sd.[collation_name], 'CodePage') AS [CodePage]
FROM   sys.databases sd
WHERE  sd.[name] = DB_NAME(); -- replace function with N'{db_name}' if not running in the DB

The query above returns (for me):

Latin1_General_100_CI_AS_SC    1252

So, let's try Code Page 1252:

System.Text.Encoding.GetEncoding(1252).GetBytes("è") // Matches HASHBYTES('MD5', 'è')
// 785D512BE4316D578E6650613B45E934

Woo hoo! We have a match for VARCHAR data that uses our default SQL Server collation :). Of course, if the data is coming from a database or field set to a different collation, then GetEncoding(1252) might not work and you will have to find the actual matching Code Page using the query shown above (a Code Page is used across many Collations, so a different Collation does not necessarily imply a different Code Page).

To see what the possible Code Page values are, and what culture / locale they pertain to, please see the list of Code Pages here (list is in the "Remarks" section).


Additional info related to what is actually stored in NVARCHAR / NCHAR fields:

Any UTF-16 character (2 or 4 bytes) can be stored, though the default behavior of the built-in functions assumes that all characters are UCS-2 (2 bytes each), which is a subset of UTF-16. Starting in SQL Server 2012, it is possible to access a set of Windows collations that support the 4 byte characters known as Supplementary Characters. Using one of these Windows collations ending in _SC, either specified for a column or directly in a query, will allow the built-in functions to properly handle the 4 byte characters.

-- The database's collation is set to: SQL_Latin1_General_CP1_CI_AS
SELECT  N'𨝫' AS [SupplementaryCharacter],
        LEN(N'𨝫') AS [LEN],
        DATALENGTH(N'𨝫') AS [DATALENGTH],
        UNICODE(N'𨝫') AS [UNICODE],
        LEFT(N'𨝫', 1) AS [LEFT],
        HASHBYTES('MD5', N'𨝫') AS [HASHBYTES];

SELECT  N'𨝫' AS [SupplementaryCharacter],
        LEN(N'𨝫' COLLATE Latin1_General_100_CI_AS_SC) AS [LEN],
        DATALENGTH(N'𨝫' COLLATE Latin1_General_100_CI_AS_SC) AS [DATALENGTH],
        UNICODE(N'𨝫' COLLATE Latin1_General_100_CI_AS_SC) AS [UNICODE],
        LEFT(N'𨝫' COLLATE Latin1_General_100_CI_AS_SC, 1) AS [LEFT],
        HASHBYTES('MD5', N'𨝫' COLLATE Latin1_General_100_CI_AS_SC) AS [HASHBYTES];

Returns:

SupplementaryChar   LEN   DATALENGTH   UNICODE   LEFT   HASHBYTES
𨝫                  2     4             55393    �     0x7A04F43DA81E3150F539C6B99F4B8FA9
𨝫                  1     4            165739    𨝫     0x7A04F43DA81E3150F539C6B99F4B8FA9

As you can see, neither DATALENGTH nor HASHBYTES are affected. For more information, please see the MSDN page for Collation and Unicode Support (specifically the "Supplementary Characters" section).

Thermosphere answered 12/1, 2015 at 22:53 Comment(4)
I was going to close this question as a duplicate of one I answered months ago but this answer is so much better than what I answered in the other question I closed that old question as a duplicate of this :)Uraninite
I had about 4 questions and this answer tackled every single one of them! Fantabulous job!Jocosity
WOW, absolutly amazing response.Drugge
Amazing response! But, what the hell?? Can we just own up to some poor decisions back in the 1960's, destroy the tower of Babel, and rebuild from scratch? IMHO, every string should have an encoding specified, and to make that not become a pain in the ass, there should be a clear system table providing code page info, etc. It's amazing that @SolomonRutzky knows all this and explains it so clearly, but the rest of us should be able to get some work done without his encyclopedic knowledge.Commemorative
X
3

SQL Server uses UCS-2 rather than UTF-8 to encode character data.

If you were using an NVarChar field, the following would work:

System.Text.Encoding.Unicode.GetBytes("è"); // Updated per @srutzky's comments

For more information on SQL and C# hashing, see

http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx

Xiaoximena answered 12/1, 2015 at 18:35 Comment(5)
many thanks Eric, is it possible to get an equivalent of the the C# hash in SQL instead?Coaction
I don't see a way, but have a look at the linked article.Xiaoximena
@ImranAzad and Eric, this information is not correct. Please see my answer for details.Thermosphere
@ImranAzad: I updated my answer to amend incorrect information and upvoted yours. Any idea why BigEndianUnicode seemed to resolve https://mcmap.net/q/169693/-c-big-endian-ucs-2 ?Xiaoximena
BigEndianUnicode worked there because there are two types of UTF-16 and UTF-32: BigEndian and LittleEndian. SQL Server and .NET default to LittleEndian (which they call "Unicode" instead of "UTF-16") and then have the other with the specific name of "BigEndianUnicode" (see: msdn.microsoft.com/en-us/library/system.text.encoding.aspx). But the data in the other question was coming from JAVA, which defaults to BigEndian (which they call "UTF-16") and then has the other with the specific name of "UTF-16LE" (see: #20967302).Thermosphere
B
0

I was having the same issue, and as @srutzky comments, what might be happening is that I didn't preceed the query with a capital-N, and I was getting an 8-bit Extended ASCII ( VARCHAR / string not prefixed with capital-N ) instead of a 16-bit UTF-16 Little Endian ( NVARCHAR / string prefixed with capital-N )

{Id, UserName, PasswordString, PasswordHashed}

If you do:

SELECT TOP 1 CONVERT(char(32),HashBytes('MD5', 'abc123'),2) FROM [Users]

It will output: E99A18C428CB38D5F260853678922E03

But if you do this, having the same password ('abc123'):

SELECT CONVERT(char(32),HashBytes('MD5', [PasswordString]),2) FROM [Users]

It will output: 6E9B3A7620AAF77F362775150977EEB8

What I should have done is:

SELECT CONVERT(char(32),HashBytes('MD5', N'abc123'),2) FROM [Users]

That outputs the same result: 6E9B3A7620AAF77F362775150977EEB8

Bookmobile answered 23/9, 2016 at 6:32 Comment(2)
Regarding "by default, a query in management studio runs in the default collation, that might differ from your table collation": This is absolutely false. SSMS has nothing to do with collations. Besides, very few collations, if any, differ for the characters in abc123. What you are seeing is the difference between 8-bit Extended ASCII ( VARCHAR / string not prefixed with capital-N ) and 16-bit UTF-16 Little Endian ( NVARCHAR / string prefixed with capital-N ). Run the following to see: SELECT HASHBYTES('MD5', 'abc123'), HASHBYTES('MD5', N'abc123');. Your column is NVARCHAR.Thermosphere
You're right, I've updated the answer with your input. Thanks.Bookmobile
S
0

sql server hashbytes allways works like System.Text.Encoding.Unicode on unicode characters like arabic persian ,... if you use Utf8.Unicode Or Ascii.Unicode You will see the diffrence and if you use Utf8.Unicode the return result of sql server and c# will be same

Sapanwood answered 19/6, 2017 at 8:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.