Is it possible to combine hash codes for private members to generate a new hash code?
Asked Answered
B

4

17

I have an object for which I want to generate a unique hash (override GetHashCode()) but I want to avoid overflows or something unpredictable.

The code should be the result of combining the hash codes of a small collection of strings.

The hash codes will be part of generating a cache key, so ideally they should be unique however the number of possible values that are being hashed is small so I THINK probability is in my favour here.

Would something like this be sufficient AND is there a better way of doing this?

int hash = 0;
foreach(string item in collection){
    hash += (item.GetHashCode() / collection.Count)
}
return hash;

EDIT: Thanks for answers so far. @Jon Skeet: No, order is not important

I guess this is almost a another question but since I am using the result to generate a cache key (string) would it make sense to use a crytographic hash function like MD5 or just use the string representation of this int?

Bellyful answered 3/7, 2009 at 12:43 Comment(2)
It sounds from your update that you are expecting the output from this process to have a sufficiently low probability of collision so as to treat it as a unique key... You need a very good hash and quite a few more bits than 32 to make this workCravat
If you do want a key then using a crypto hash will normally be enough (so long as you don't care about it's crypto properties MD5 is fine) but it will be considerably more expensive to calculate than other just as effective non crypto hashes would be.Cravat
C
24

The fundamentals pointed out by Marc and Jon are not bad but they are far from optimal in terms of their evenness of distribution of the results. Sadly the 'multiply by primes' approach copied by so many people from Knuth is not the best choice in many cases better distribution can be achieved by cheaper to calculate functions (though this is very slight on modern hardware). In fact throwing primes into many aspects of hashing is no panacea.

If this data is used for significantly sized hash tables I recommend reading of Bret Mulvey's excellent study and explanation of various modern (and not so modern) hashing techniques handily done with c#.

Note that the behaviour with strings of various hash functions is heavily biased towards wehther the strings are short (roughly speaking how many characters are hashed before the bits begin to over flow) or long.

One of the simplest and easiest to implement is also one of the best, the Jenkins One at a time hash.

private static unsafe void Hash(byte* d, int len, ref uint h)
{
    for (int i = 0; i < len; i++)
    {
        h += d[i];
        h += (h << 10);
        h ^= (h >> 6);
    }
}

public unsafe static void Hash(ref uint h, string s)
{
    fixed (char* c = s)            
    {
        byte* b = (byte*)(void*)c;
        Hash(b, s.Length * 2, ref h);
    }
}

public unsafe static int Avalanche(uint h)
{
    h += (h<< 3);   
    h ^= (h>> 11);  
    h += (h<< 15);  
    return *((int*)(void*)&h);
}

you can then use this like so:

uint h = 0;
foreach(string item in collection) 
{
    Hash(ref h, item);
}
return Avalanche(h);

you can merge multiple different types like so:

public unsafe static void Hash(ref uint h, int data)
{ 
    byte* d = (byte*)(void*)&data;
    AddToHash(d, sizeof(int), ref h);
}

public unsafe static void Hash(ref uint h, long data)
{ 
    byte* d= (byte*)(void*)&data;
    Hash(d, sizeof(long), ref h);
}

If you only have access to the field as an object with no knowledge of the internals you can simply call GetHashCode() on each one and combine that value like so:

uint h = 0;
foreach(var item in collection) 
{
    Hash(ref h, item.GetHashCode());
}
return Avalanche(h);

Sadly you can't do sizeof(T) so you must do each struct individually.

If you wish to use reflection you can construct on a per type basis a function which does structural identity and hashing on all fields.

If you wish to avoid unsafe code then you can use bit masking techniques to pull out individual bits from ints (and chars if dealing with strings) with not too much extra hassle.

Cravat answered 3/7, 2009 at 13:43 Comment(5)
It seems to me that the link you posted talks about using a hash value modulo a prime, not generating the hash value itself. In other words, it's not the hash generation, it's hash -> bucket transformation.Swope
If you look at the next link (Brets SimpleHash analysis) it shows how poor it is at uniformity of distribution, home.comcast.net/~bretm/hash/5.html takes the SimpleHash described as the first testCravat
You're right in that the link where it stands confuses the issue. will reworkCravat
The link (my.xfinity.com/~bretm/hash) to the hash techniques article is broken it seems. Anyone know where it went?Marsupial
The reference link is broken, can you please share the new link?Asberry
S
24

Hashes aren't meant to be unique - they're just meant to be well distributed in most situations. They're just meant to be consistent. Note that overflows shouldn't be a problem.

Just adding isn't generally a good idea, and dividing certainly isn't. Here's the approach I usually use:

int result = 17;
foreach (string item in collection)
{
    result = result * 31 + item.GetHashCode();
}
return result;

If you're otherwise in a checked context, you might want to deliberately make it unchecked.

Note that this assumes that order is important, i.e. that { "a", "b" } should be different from { "b", "a" }. Please let us know if that's not the case.

Swope answered 3/7, 2009 at 12:46 Comment(10)
lol - we chose different primes (and I threw in a Count.GetHashCode), but yet again we mimic eachother ;-pGlasshouse
Indeed - I'm not quite sure why you're multiplying the hash code each time though, given that it'll be multiplied again in a minute anyway...Swope
True, true - but you have more detail anyway so I'm deleting.Glasshouse
in a minute? sounds like you need a faster computer Jon ;)Arnettaarnette
It's not that I think your answer's bad. I just hate to see the 'multiply by prime' stuff perpetuateCravat
As it turns out this is causing a "System.OverflowException : Arithmetic operation resulted in an overflow." which is what I wanted to avoid. Am I missing something obvious?Bellyful
this operation should be allowed to overflow, this is the desired hashing behaviour (some data has to be discarded as you push more and more into a finite space). simpliy placing an unchecked context block around it will be sufficient.Cravat
The other method I've seen is to use bit rotation to preserve more information. This is a good article on different hash algorithms, eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspxTitular
Any reason why you're not using an XOR (which is the standard operator I use for hashcode combination)?Fan
How would you represent this if order was not important? Is this possibly where the ^ operator comes in?Canoe
C
24

The fundamentals pointed out by Marc and Jon are not bad but they are far from optimal in terms of their evenness of distribution of the results. Sadly the 'multiply by primes' approach copied by so many people from Knuth is not the best choice in many cases better distribution can be achieved by cheaper to calculate functions (though this is very slight on modern hardware). In fact throwing primes into many aspects of hashing is no panacea.

If this data is used for significantly sized hash tables I recommend reading of Bret Mulvey's excellent study and explanation of various modern (and not so modern) hashing techniques handily done with c#.

Note that the behaviour with strings of various hash functions is heavily biased towards wehther the strings are short (roughly speaking how many characters are hashed before the bits begin to over flow) or long.

One of the simplest and easiest to implement is also one of the best, the Jenkins One at a time hash.

private static unsafe void Hash(byte* d, int len, ref uint h)
{
    for (int i = 0; i < len; i++)
    {
        h += d[i];
        h += (h << 10);
        h ^= (h >> 6);
    }
}

public unsafe static void Hash(ref uint h, string s)
{
    fixed (char* c = s)            
    {
        byte* b = (byte*)(void*)c;
        Hash(b, s.Length * 2, ref h);
    }
}

public unsafe static int Avalanche(uint h)
{
    h += (h<< 3);   
    h ^= (h>> 11);  
    h += (h<< 15);  
    return *((int*)(void*)&h);
}

you can then use this like so:

uint h = 0;
foreach(string item in collection) 
{
    Hash(ref h, item);
}
return Avalanche(h);

you can merge multiple different types like so:

public unsafe static void Hash(ref uint h, int data)
{ 
    byte* d = (byte*)(void*)&data;
    AddToHash(d, sizeof(int), ref h);
}

public unsafe static void Hash(ref uint h, long data)
{ 
    byte* d= (byte*)(void*)&data;
    Hash(d, sizeof(long), ref h);
}

If you only have access to the field as an object with no knowledge of the internals you can simply call GetHashCode() on each one and combine that value like so:

uint h = 0;
foreach(var item in collection) 
{
    Hash(ref h, item.GetHashCode());
}
return Avalanche(h);

Sadly you can't do sizeof(T) so you must do each struct individually.

If you wish to use reflection you can construct on a per type basis a function which does structural identity and hashing on all fields.

If you wish to avoid unsafe code then you can use bit masking techniques to pull out individual bits from ints (and chars if dealing with strings) with not too much extra hassle.

Cravat answered 3/7, 2009 at 13:43 Comment(5)
It seems to me that the link you posted talks about using a hash value modulo a prime, not generating the hash value itself. In other words, it's not the hash generation, it's hash -> bucket transformation.Swope
If you look at the next link (Brets SimpleHash analysis) it shows how poor it is at uniformity of distribution, home.comcast.net/~bretm/hash/5.html takes the SimpleHash described as the first testCravat
You're right in that the link where it stands confuses the issue. will reworkCravat
The link (my.xfinity.com/~bretm/hash) to the hash techniques article is broken it seems. Anyone know where it went?Marsupial
The reference link is broken, can you please share the new link?Asberry
A
1

There is nothing wrong with this approach as long as the members whose hashcodes you are combining follow the rules of hash codes. In short ...

  1. The hash code of the private members should not change for the lifetime of the object
  2. The container must not change the object the private members point to lest it in turn change the hash code of the container
Ashantiashbaugh answered 3/7, 2009 at 12:45 Comment(0)
S
1

If the order of the items is not important (i.e. {"a","b"} is the same as {"b","a"}) then you can use exclusive or to combine the hash codes:

hash ^= item.GetHashCode();

[Edit: As Mark pointed out in a comment to a different answer, this has the drawback of also give collections like {"a"} and {"a","b","b"} the same hash code.]

If the order is important, you can instead multiply by a prime number and add:

hash *= 11;
hash += item.GetHashCode();

(When you multiply you will sometimes get an overflow that is ignored, but by multiplying with a prime number you lose a minimum of information. If you instead multiplied with a number like 16, you would lose four bits of information each time, so after eight items the hash code from the first item would be completely gone.)

Shrunk answered 3/7, 2009 at 12:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.