I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet
contains mostly strings with the same.
Does String.GetHashCode consider the full string or only part of it?
Asked Answered
Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs
public override int GetHashCode() {
#if FEATURE_RANDOMIZED_STRING_HASHING
if(HashHelpers.s_UseRandomizedStringHashing)
{
return InternalMarvin32HashString(this, this.Length, 0);
}
#endif // FEATURE_RANDOMIZED_STRING_HASHING
unsafe {
fixed (char *src = this) {
Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
#if WIN32
int hash1 = (5381<<16) + 5381;
#else
int hash1 = 5381;
#endif
int hash2 = hash1;
#if WIN32
// 32 bit machines.
int* pint = (int *)src;
int len = this.Length;
while (len > 2)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
pint += 2;
len -= 4;
}
if (len > 0)
{
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
}
#else
int c;
char *s = src;
while ((c = s[0]) != 0) {
hash1 = ((hash1 << 5) + hash1) ^ c;
c = s[1];
if (c == 0)
break;
hash2 = ((hash2 << 5) + hash2) ^ c;
s += 2;
}
#endif
#if DEBUG
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
#endif
return hash1 + (hash2 * 1566083941);
}
}
}
This is possibly more than you bargained for, I'll annotate the code a bit:
- The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
- The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
- The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
- The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
- The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
- The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
- The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
- The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
- The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.
"it terminates early on an embedded zero" - That is bizarre. I tested it and sure enough the 64bit version ignores chars after a \0 (32bit doesn't). And since NULL is a valid unicode character, this is technically a bug IMO. –
Idaidae
Logged on microsoft connect ~ connect.microsoft.com/VisualStudio/feedback/details/817902/… –
Idaidae
@locster the bug was closed with "as By Design". Has anyone found an explanation why the 64-bit version doesn't hash characters after \0? –
Potvaliant
If MS claim it was by design then they need to give an explanation, otherwise the reasonable assumption is that it is an error but perhaps one they can no longer fix without risks. In live systems there is a .Net config setting to specify use of an alternative hashing scheme, hence there is a simple workaround. Still seems dumb to not address this given that's it's such a key class in the system, and 64 bit is becoming the norm. –
Idaidae
"designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption" It's not going to help you a lot there, a bit but not a lot. It would help if you were hashing strings that came direct from user-input leaving you open to hash-dos attacks. –
Scarecrow
It seems to me that they used \0 to terminate because they don't have exact length stored anywhere while outside of WIN32 block. So they went for the null termination. String class' fields may differ for platforms. Just a guess. –
Competition
How come is not computed only once since string is immutable? –
Mandamandaean
The 64-bit version of the loop is not unrolled, in the usual sense. The two stages operate on
hash1
and hash2
respectively, which are then combined when the loop exits. The zero check is to test for the end of the null-terminated string (in odd positions), as is done in the while
condition (for even positions). –
Kershner @HansPassant You could store the hash code basically anywhere in the string object. Before the length field, between the length field and the string data or even after the string data. –
Elin
hash code could be stored in the string class itself? or maybe its impossible to change this implementation because of backwards compatibility? –
Etheleneethelin
@JonHanna: To my knowledge, that's the actual reason it exists; passwords and encryption "benefits" are accidental, the real goal is ensuring that web-facing applications can't be attacked with inputs (that are stored in a dictionary) that will reliably collide on the hash code (changing
O(1)
optimal and normal case operations into O(n)
worst case performance for every lookup, insertion, etc.). –
Thi Just out of the topic....the while loop confirms that GetHashCode() is a linear operation. –
Contribution
Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.
// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
IntPtr arg_0F_0;
IntPtr expr_06 = arg_0F_0 = this;
if (expr_06 != 0)
{
arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
}
char* ptr = arg_0F_0;
int num = 352654597;
int num2 = num;
int* ptr2 = (int*)ptr;
for (int i = this.Length; i > 0; i -= 4)
{
num = ((num << 5) + num + (num >> 27) ^ *ptr2);
if (i <= 2)
{
break;
}
num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
ptr2 += (IntPtr)8 / 4;
}
return num + num2 * 1566083941;
}
yeah, I saw that, but I have no clue what it does :o –
Gerhard
wait. On second reading, it seems that it is different from the code in my ILSpy. Mine doesn't have the for loop over Length. Maybe it is implemented differently in different platform –
Gerhard
Um, it hashes the string. You did say you wanted to know what it does, so there it is. The string hash algorithm is different for different versions of the CLR. –
Groove
@LouisRhys - that was the one from .NET 2.0 (since I already had that loaded in ILSpy). I've replaced it with the one from .NET 4.0 - looks very similar. –
Nguyen
© 2022 - 2024 — McMap. All rights reserved.
"".GetHashCode() == 371857150
. Is that the same for everyone? – Rencontre"".GetHashCode()
– Bandstand