Persistent hashcode for strings [duplicate]
Asked Answered
L

1

21

I want to generate an integer hashcode for strings, that will stay constant forever; i.e. the same string should always result in the same hashcode.

The hash does not have to be cryptographically secure, it will not be used for passwords or sensitive data.

My first attempt was to use the .net framework string.GetHashCode() function. However upon reading the sources I found the following commment:

// We want to ensure we can change our hash function daily. 
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A 
// hashing before string B.  Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;

This seems to indicate that the hashcode will not remain constant.

If so, does the framework have another method to generate repeatable hashcodes? Or would the code from GetHashCode be a reasonable starting point to implement my own?

I am looking for something as lightweight and fast as possible.
I found System.Security.Cryptography.MD5, but that seems overkill for a simple int32 hashcode, and I am worried about the overhead. At the very least it would require conversion from string to byte array, and from byte array to int, and either creation of a new MD5() object for each hash, or management of some static shared MD5 object().

Logicize answered 25/4, 2016 at 15:53 Comment(3)
If you Google for "string hash code algorithm" you'll get good results. There is no need for someone to copy the code and post it here.Publisher
I do not need a code sample of unknown quality for generating a hashcode, I am looking for a .net framework method or combination of methods, since this seems like a fundamental need for any framework. I was higly surprised that string.GetHashCode is apparently not useful for this purpose, and I was unable to find a suitable alternative. I would furthermore expect that an answer would help others as well as me, that shows where this functionality is hidden in the framework, or alternatively that it does not exist.Logicize
The search term that finally got me a good result was ".net stable string hash code".Publisher
F
64

There is no built in, cross version stable, way to get a hash code of a string.

You could just copy the existing GetHashCode() code but exclude the portion that adds the build number as the seed and don't use unsafe calls to keep yourself safe from implementation detail changes.

Here is a fully managed version of the 64bit GetHashCode() that does not use any randomization and will return the same value for all future versions of .NET (as long as the behavior of int ^ char never changes).

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}
Fractostratus answered 25/4, 2016 at 16:16 Comment(4)
btw this is close to the current (4.5/4.6) actual GetHashCode in String.cs.Swaine
@JonathanNappee I said it was, I even linked to String.cs in the answer. However the "real" one uses pointers and relies on the implementation detail of how strings map to char*. Doing pointers would be slightly faster but it is not as future proof because a change to how strings are stored in memory will change the hash code.Fractostratus
My bad, didn't see where the link was pointing to.Swaine
Why are there null checks in there? Wouldn't that terminate the function too early if the string had \0 somewhere in the middle?Diaper

© 2022 - 2024 — McMap. All rights reserved.