Generate integer based on any given string (without GetHashCode)
Asked Answered
L

4

26

I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.

I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string

It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.

The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.

The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.

Any insight at all would be greatly appreciated.

Lavender answered 11/11, 2014 at 17:0 Comment(4)
Have you implemented a wellknown algorithm as suggested here?Deandreadeane
Are there any limitations to the string's structure (size, encoding)?Iatrogenic
There are no limitations per say, but any given string won't exceed more than a hundred characters or so.Lavender
You want a hash code, just not .NET's implementation (because it can vary). So it seems to me you should be researching hash code implementations, to find one that does suit your needs. If .NET's GetHashCode() is otherwise suitable for your needs, you could even decompile it (some version of it) and encapsulate it in a private implementation so you know it won't change. If that doesn't work, you should do your research and then come to SO with help with any specific issues that might come up.Rockfish
S
37

MD5 hashing returns a byte array which could be converted to an integer:

var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);

Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.

Spool answered 11/11, 2014 at 17:26 Comment(3)
Thanks for the response. This is more efficient version of the same code that I figured out a few minutes ago. My method now produces the same result on my machine and the server. Even though MD5 is working in my case, would MD5 ever produce different results on another machine?Lavender
@user1691808 md5 is platform agnostic. For the same input, it will produce the same value regardless of machine/language.Platinize
Another note of something I did learn, a mysql BIGINT is always 8 bytes and can store -9223372036854775808 to 9223372036854775807, which appears to be the same range of integers the Int64 version of your code would produce. Just a tidbit of knowledge I wasn't aware of before.Lavender
B
16

If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.

If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.

As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.

Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.

That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.

Brotherly answered 11/11, 2014 at 18:30 Comment(0)
N
4

This code map any string to int between 0-100

int x= "ali".ToCharArray().Sum(x => x)%100;
Nephro answered 4/10, 2017 at 11:18 Comment(1)
Taking the sum will give a normal distribution instead of something uniform, this will increase the likelyhood of collisions.Sharasharai
W
-1
using (MD5 md5 = MD5.Create())
{
    bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}

BigInteger requires Org.BouncyCastle.Math

Wheresoever answered 22/5, 2018 at 23:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.