Reverse Engineering String.GetHashCode
Asked Answered
S

4

4

String.GetHashCode's behavior is depend on the program architecture. So it will return one value in x86 and one value on x64. I have a test application which must run in x86 and it must predict the hash code output from an application which must run on x64.

Below is the disassembly of the String.GetHashCode implementation from mscorwks.

public override unsafe int GetHashCode()
{
      fixed (char* text1 = ((char*) this))
      {
            char* chPtr1 = text1;
            int num1 = 0x15051505;
            int num2 = num1;
            int* numPtr1 = (int*) chPtr1;
            for (int num3 = this.Length; num3 > 0; num3 -= 4)
            {
                  num1 = (((num1 << 5) + num1) + (num1 >≫ 0x1b)) ^ numPtr1[0];
                  if (num3 <= 2)
                  {
                        break;
                  }
                  num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr1[1];
                  numPtr1 += 2;
            }
            return (num1 + (num2 * 0x5d588b65));
      }
}

Can anybody port this function to a safe implementation??

Stere answered 1/12, 2011 at 23:1 Comment(6)
You can access characters in a string using an indexer.Morgue
Why do you need to have the hash codes match? Do not store hash codes for any purpose, as the implementation can be changed at any time! Eric Lippert's blog. Scroll down to "Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains". Also the section "Security issue: do not use GetHashCode "off label" "Ticking
Also see this questionTicking
i didn't realized gethashcode was so pliable. so yes - fools errand. i agree. great advice. i shall pursue the use of a repeatable and architecture agnostic hashing algorithm.Stere
Kenn: For future reference, you might find the reference source of interest. Why disassemble the .Net Framework code when you can just look at actual, fully commented source code instead?Merralee
For those who are stuck with their (or someone else's) regrettable past decision to depend on this hash algorithm, or if you'd just like to have a simple and repeatable string hash, there is a safe implementation below.Horvath
C
20

Hash codes are not intended to be repeatable across platforms, or even multiple runs of the same program on the same system. You are going the wrong way. If you don't change course, your path will be difficult and one day it may end in tears.

What is the real problem you want to solve? Would it be possible to write your own hash function, either as an extension method or as the GetHashCode implementation of a wrapper class and use that one instead?

Chancelor answered 1/12, 2011 at 23:9 Comment(3)
Only may end in tears?!Rapscallion
@bacar: Everything could be working just fine in spite of you violating the rules. Then one day something happens (.NET update? server migration? anything really) and your app breaks without warning.Chancelor
I just feel more certain than you that it will end in tears :-)Rapscallion
P
16

First off, Jon is correct; this is a fool's errand. The internal debug builds of the framework that we use to "eat our own dogfood" change the hash algorithm every day precisely to prevent people from building systems -- even test systems -- that rely on unreliable implementation details that are documented as subject to change at any time.

Rather than enshrining an emulation of a system that is documented as being not suitable for emulation, my recommendation would be to take a step back and ask yourself why you're trying to do something this dangerous. Is it really a requirement?

Second, StackOverflow is a technical question and answer site, not a "do my job for me for free" site. If you are hell bent on doing this dangerous thing and you need someone who can rewrite unsafe code into equivalent safe code then I recommend that you hire someone who can do that for you.

Philoctetes answered 1/12, 2011 at 23:49 Comment(10)
Oh god! Eric, you are really angry :)Oliy
i didn't realized gethashcode was so pliable. so yes - fools errand. i agree. great advice. i shall pursue the use of a repeatable and architecture agnostic hashing algorithm.Stere
@Konstantin: I'm not angry at all; as he my customer, I want Kenn to be successful. I'm therefore firm in my statements that (1) this is a bad idea that is likely to result in high costs in the future, and (2) StackOverflow is a good place to ask for facts and a bad place to ask for free work. People who know that are more likely to use StackOverflow successfully.Philoctetes
How do you avoid problems with code running around midnight? You should change the algorithm every time you run the code rather than every day...Illinois
@configurator: I was unclear. The algorithm changes every day because the internal version of the algorithm takes into account the current build number, which changes every day.Philoctetes
A pity you don't use a similar mechanism(For example different for each process) in release builds to prevent people to accidentally relying on undefined behavior.Merchandising
Reading the source code of String.cs makes me grin. //... Those are bugs in your code. \n hash1 ^= ThisAssembly.DailyBuildNumber;Merralee
@CodeInChaos: Ah, but the framework's build number doesn't change with every run, so there's nothing to plug into the calculation.Illinois
@Illinois that's why I suggested a similar mechanism that doesn't change with the build, but rather with each instance of the process.Merchandising
@CodeInChaos: Sorry, I completely missed the parenthetical.Illinois
H
5

While all of the warnings given here are valid, they don't answer the question. I had a situation in which GetHashCode() was unfortunately already being used for a persisted value in production, and I had no choice but to re-implement using the default .NET 2.0 32-bit x86 (little-endian) algorithm. I re-coded without unsafe as shown below, and this appears to be working. Hope this helps someone.

// The GetStringHashCode() extension method is equivalent to the Microsoft .NET Framework 2.0
// String.GetHashCode() method executed on 32 bit systems.
public static int GetStringHashCode(this string value)
{
    int hash1 = (5381 << 16) + 5381;
    int hash2 = hash1;

    int len = value.Length;
    int intval;
    int c0, c1;
    int i = 0;
    while (len > 0)
    {
        c0 = (int)value[i];
        c1 = len > 1 ? (int)value[i + 1] : 0;
        intval = c0 | (c1 << 16);
        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ intval;
        if (len <= 2)
        {
            break;
        }
        i += 2;
        c0 = (int)value[i];
        c1 = len > 3 ? (int)value[i + 1] : 0;
        intval = c0 | (c1 << 16);
        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ intval;
        len -= 4;
        i += 2;
    }

    return hash1 + (hash2 * 1566083941);
}
Horvath answered 18/5, 2016 at 19:29 Comment(1)
Thank you for this. It goes without saying that hashcodes should not be relied upon but it still helps to have the question answered. In my case, I'm using a hashcode to resolve to an index of a pre-defined colour palette to render logs. This code helped port to Javascript. The specific feature I needed here is that similar input strings generate vastly different integers.Lovable
A
0

The following exactly reproduces the default String hash codes on .NET 4.7 (and probably earlier). This is the hash code given by:

  • Default on a String instance: "abc".GetHashCode()
  • StringComparer.Ordinal.GetHashCode("abc")
  • Various String methods that take StringComparison.Ordinal enumeration.
  • System.Globalization.CompareInfo.GetStringComparer(CompareOptions.Ordinal)

Testing on release builds with full JIT optimization, these versions modestly outperform the built-in .NET code, and have also been heavily unit-tested for exact equivalence with .NET behavior. Notice there are separate versions for x86 versus x64. Your program should generally include both; below the respective code listings is a calling harness which selects the appropriate version at runtime.

x86   -   (.NET running in 32-bit mode)

static unsafe int GetHashCode_x86_NET(int* p, int c)
{
    int h1, h2 = h1 = 0x15051505;

    while (c > 2)
    {
        h1 = ((h1 << 5) + h1 + (h1 >> 27)) ^ *p++;
        h2 = ((h2 << 5) + h2 + (h2 >> 27)) ^ *p++;
        c -= 4;
    }

    if (c > 0)
        h1 = ((h1 << 5) + h1 + (h1 >> 27)) ^ *p++;

    return h1 + (h2 * 0x5d588b65);
}

x64   -   (.NET running in 64-bit mode)

static unsafe int GetHashCode_x64_NET(Char* p)
{
    int h1, h2 = h1 = 5381;

    while (*p != 0)
    {
        h1 = ((h1 << 5) + h1) ^ *p++;

        if (*p == 0)
            break;

        h2 = ((h2 << 5) + h2) ^ *p++;
    }
    return h1 + (h2 * 0x5d588b65);
}

Calling harness / extension method for either platform (x86/x64):

readonly static int _hash_sz = IntPtr.Size == 4 ? 0x2d2816fe : 0x162a16fe;

public static unsafe int GetStringHashCode(this String s)
{
    /// Note: x64 string hash ignores remainder after embedded '\0'char (unlike x86)
    if (s.Length == 0 || (IntPtr.Size == 8 && s[0] == '\0'))
        return _hash_sz;

    fixed (char* p = s)
        return IntPtr.Size == 4 ?
            GetHashCode_x86_NET((int*)p, s.Length) :
            GetHashCode_x64_NET(p);
}
Apiece answered 13/2, 2018 at 21:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.