Does String.GetHashCode consider the full string or only part of it?

Asked 2/3, 2013 at 12:29 Answered 2/3, 2013 at 16:10

Solved c#.net string performance gethashcode

I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.

Gerhard answered 2/3, 2013 at 12:29 Comment(4)

dotnetperls.com/gethashcode – Meaty 2/3, 2013 at 12:31

"it will have bad performance" - bad compared with what alternative? Obviously storing very, very long strings in a HashSet will be slower than storing short ones, but how often is that done? – Fled 2/3, 2013 at 12:40

On computer "".GetHashCode() == 371857150. Is that the same for everyone? – Rencontre 18/3, 2016 at 10:1

@ColonelPanic That's what it looks like from the code posted below, assuming everyone is using a release version of the runtime. csharppad.com produces the same value as you for "".GetHashCode() – Bandstand 14/10, 2016 at 19:17

102

Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
            if(HashHelpers.s_UseRandomizedStringHashing)
            { 
                return InternalMarvin32HashString(this, this.Length, 0);
            } 
#endif // FEATURE_RANDOMIZED_STRING_HASHING 

            unsafe { 
                fixed (char *src = this) {
                    Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                    Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
                    int hash1 = (5381<<16) + 5381; 
#else 
                    int hash1 = 5381;
#endif 
                    int hash2 = hash1;

#if WIN32
                    // 32 bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length; 
                    while (len > 2) 
                    {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2;
                        len  -= 4;
                    } 

                    if (len > 0) 
                    { 
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                    } 
#else
                    int     c;
                    char *s = src;
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c;
                        c = s[1]; 
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c; 
                        s += 2;
                    }
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily.
                    // This is perfectly fine as long as you don't persist the 
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber; 
#endif
                    return hash1 + (hash2 * 1566083941);
                }
            } 
        }

This is possibly more than you bargained for, I'll annotate the code a bit:

The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

Someplace answered 2/3, 2013 at 16:10 Comment(12)

"it terminates early on an embedded zero" - That is bizarre. I tested it and sure enough the 64bit version ignores chars after a \0 (32bit doesn't). And since NULL is a valid unicode character, this is technically a bug IMO. – Idaidae 18/2, 2014 at 17:48

Logged on microsoft connect ~ connect.microsoft.com/VisualStudio/feedback/details/817902/… – Idaidae 18/2, 2014 at 18:3

@locster the bug was closed with "as By Design". Has anyone found an explanation why the 64-bit version doesn't hash characters after \0? – Potvaliant 2/7, 2014 at 10:31

If MS claim it was by design then they need to give an explanation, otherwise the reasonable assumption is that it is an error but perhaps one they can no longer fix without risks. In live systems there is a .Net config setting to specify use of an alternative hashing scheme, hence there is a simple workaround. Still seems dumb to not address this given that's it's such a key class in the system, and 64 bit is becoming the norm. – Idaidae 3/7, 2014 at 10:48

"designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption" It's not going to help you a lot there, a bit but not a lot. It would help if you were hashing strings that came direct from user-input leaving you open to hash-dos attacks. – Scarecrow 7/7, 2015 at 9:28

It seems to me that they used \0 to terminate because they don't have exact length stored anywhere while outside of WIN32 block. So they went for the null termination. String class' fields may differ for platforms. Just a guess. – Competition 14/9, 2015 at 14:49

How come is not computed only once since string is immutable? – Mandamandaean 10/2, 2016 at 15:41

The 64-bit version of the loop is not unrolled, in the usual sense. The two stages operate on hash1 and hash2 respectively, which are then combined when the loop exits. The zero check is to test for the end of the null-terminated string (in odd positions), as is done in the while condition (for even positions). – Kershner 10/4, 2018 at 0:12

@HansPassant You could store the hash code basically anywhere in the string object. Before the length field, between the length field and the string data or even after the string data. – Elin 16/10, 2018 at 15:30

hash code could be stored in the string class itself? or maybe its impossible to change this implementation because of backwards compatibility? – Etheleneethelin 31/12, 2018 at 21:32

@JonHanna: To my knowledge, that's the actual reason it exists; passwords and encryption "benefits" are accidental, the real goal is ensuring that web-facing applications can't be attacked with inputs (that are stored in a dictionary) that will reliably collide on the hash code (changing O(1) optimal and normal case operations into O(n) worst case performance for every lookup, insertion, etc.). – Thi 23/1, 2020 at 20:5

Just out of the topic....the while loop confirms that GetHashCode() is a linear operation. – Contribution 14/8, 2020 at 12:11

Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.

// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
    IntPtr arg_0F_0;
    IntPtr expr_06 = arg_0F_0 = this;
    if (expr_06 != 0)
    {
        arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
    }
    char* ptr = arg_0F_0;
    int num = 352654597;
    int num2 = num;
    int* ptr2 = (int*)ptr;
    for (int i = this.Length; i > 0; i -= 4)
    {
        num = ((num << 5) + num + (num >> 27) ^ *ptr2);
        if (i <= 2)
        {
            break;
        }
        num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
        ptr2 += (IntPtr)8 / 4;
    }
    return num + num2 * 1566083941;
}

Nguyen answered 2/3, 2013 at 12:34 Comment(4)

yeah, I saw that, but I have no clue what it does :o – Gerhard 2/3, 2013 at 12:35

wait. On second reading, it seems that it is different from the code in my ILSpy. Mine doesn't have the for loop over Length. Maybe it is implemented differently in different platform – Gerhard 2/3, 2013 at 12:41

Um, it hashes the string. You did say you wanted to know what it does, so there it is. The string hash algorithm is different for different versions of the CLR. – Groove 2/3, 2013 at 12:41

@LouisRhys - that was the one from .NET 2.0 (since I already had that loaded in ILSpy). I've replaced it with the one from .NET 4.0 - looks very similar. – Nguyen 2/3, 2013 at 12:47

Recommended topics

Hot tags