new KeyValuePair<UInt32, UInt32>(i, j).GetHashCode(); High Rate of Duplicates
Asked Answered
S

4

15

In search of a fast composite key for Dictionary I came upon anomaly I cannot understand nor justify.

In limited testing

Dictionary<KeyValuePair<UInt32, UInt32>, string>

is significantly slower (200:1) than

Dictionary<KeyValuePair<UInt16, UInt16>, string>

Test on two loops from 0 to 1000 Populate and then ContainsKey

         Poplulate     ContainsKey  
UInt32    92085         86578  
UInt16     2201           431

The problem is that

new KeyValuePair<UInt32, UInt32>(i, j).GetHashCode();

yields MANY duplicates.
In looping i and j 1024 only 1024 unique hash values are created.

Based on avalanche comment from CasperOne tried i*31 and j*97 (two prime numbers) and that resulted in 105280 unique on 1024X1024. Still a lot of duplicates. CasperOne I know that is not the same as random. But it is not my job to randomize the input. GetHashCode() is supposed to randomize the output.

Why the high number of duplicates?

Same loop on

new KeyValuePair<UInt16, UInt16>(i, j).GetHashCode();

yields 1024 X 1024 unique hash codes (perfect).

Int32 has the same problem.

These duplicate hash values kill

Dictionary<KeyValuePair<UInt32, UInt32>, string> 

Tuple also generates a lot of duplicates it does not degrade at Int32 compared to Int16.

Time for generating the raw KVP and the raw KPV.GetHashCode is similar.

Same anomaly with HashSet.

Dictionary<KeyValuePair<UInt32, UInt32>, string> dKVPu32 = new Dictionary<KeyValuePair<UInt32, UInt32>, string>();
Dictionary<KeyValuePair<UInt16, UInt16>, string> dKVPu16 = new Dictionary<KeyValuePair<UInt16, UInt16>, string>();
KeyValuePair<UInt32, UInt32> kvpUint32;
KeyValuePair<UInt16, UInt16> kvpUint16;
int range = 1000;
Int32 hashCode;
HashSet<Int32> kvpUint32Hash = new HashSet<Int32>();
HashSet<Int32> kvpUint16Hash = new HashSet<Int32>();

Stopwatch sw = new Stopwatch();
sw.Start();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        kvpUint32 = new KeyValuePair<UInt32, UInt32>(i, j);
    }
}
Console.WriteLine("UInt32  raw " + sw.ElapsedMilliseconds.ToString());
//  7
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        kvpUint16 = new KeyValuePair<UInt16, UInt16>(i, j);
    }
}
Console.WriteLine("UInt16  raw " + sw.ElapsedMilliseconds.ToString());
//  6
sw.Restart();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        hashCode = new KeyValuePair<UInt32, UInt32>(i, j).GetHashCode();
        kvpUint32Hash.Add(hashCode);
    }
}
Console.WriteLine("UInt32  GetHashCode " + sw.ElapsedMilliseconds.ToString() + "  unique count " + kvpUint32Hash.Count.ToString());
//  285   1024
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        hashCode = new KeyValuePair<UInt16, UInt16>(i, j).GetHashCode();
        kvpUint16Hash.Add(hashCode);
    }
}
Console.WriteLine("UInt16  GetHashCode " + sw.ElapsedMilliseconds.ToString() + "  unique count " + kvpUint16Hash.Count.ToString());
//  398 1000000
sw.Restart();
Console.ReadLine();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        dKVPu32.Add(new KeyValuePair<UInt32, UInt32>(i, j), String.Format("{0} {1}", i.ToString(), j.ToString()));
    }
}
Console.WriteLine("hsKVPu32 pop " + sw.ElapsedMilliseconds.ToString());
//  92085
sw.Restart();
for (UInt32 i = 0; i < range; i++)
{
    for (UInt32 j = 0; j < range; j++)
    {
        if (!dKVPu32.ContainsKey(new KeyValuePair<UInt32, UInt32>(i, j))) Debug.WriteLine("Opps"); ;
    }
}
Console.WriteLine("hsKVPu32 find " + sw.ElapsedMilliseconds.ToString());
//  86578
dKVPu32.Clear();
dKVPu32 = null;
GC.Collect();
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        dKVPu16.Add(new KeyValuePair<UInt16, UInt16>(i, j), String.Format("{0} {1}", i.ToString(), j.ToString()));
    }
}
Console.WriteLine("hsKVPu16 pop " + sw.ElapsedMilliseconds.ToString());
//   2201
sw.Restart();
for (UInt16 i = 0; i < range; i++)
{
    for (UInt16 j = 0; j < range; j++)
    {
        if (!dKVPu16.ContainsKey(new KeyValuePair<UInt16, UInt16>(i, j))) Debug.WriteLine("Opps"); ;
    }
}
sw.Stop();
Console.WriteLine("hsKVPu16 find " + sw.ElapsedMilliseconds.ToString());
//  431

P.S. The fastest is to pack .E.G. ((UInt32)int1 << 16) | int2;

The hash of first UInt32 column equals hash of KVP of the next two.

2281371105 8 992
2281371104 8 993
2281371107 8 994

2281371145 0 0
2281371147 0 2
2281371149 0 4
2281371151 0 6
2281371137 0 8

2281371144 0 1
2281371146 0 3
2281371148 0 5
2281371150 0 7
2281371136 0 9

2281371144 1 0
2281371145 1 1
2281371146 1 2
2281371147 1 3
2281371148 1 4
2281371149 1 5
2281371150 1 6
2281371151 1 7
2281371136 1 8
2281371137 1 9

2281371147 2 0
2281371146 2 1
2281371144 2 3
2281371151 2 4
2281371150 2 5
2281371149 2 6
2281371148 2 7
2281371139 2 8

The only pattern I have found is that either the sum or difference or the KVP matches.
But could not find a pattern for when to sum and when to subtract.
It is a bad hash so knowing what it is is of little value.

Spatula answered 29/9, 2012 at 23:16 Comment(10)
have you tried profiling it? line by line?Myrtlemyrvyn
@KrzysztofKoźmic Don't follow. Please explain.Spatula
I'm guessing that an avalanche effect is taking place here on the 32-bit KVP; you're using the same range of values and two 32-bit. Values have to collide somewhere when generating the hash code for those KVPs. With the 16-bit values, you wouldn't theoretically never have to have collisions. A better test would be to generate random values in the 16-bit range and try the 32-bit and 16-bit KVPs, I imagine the results would be much more consistent then.Fondea
@blam I mean have you tried running it with a profilerMyrtlemyrvyn
@Fondea OK you have a point and I will do more testing. But 1,000 is small compared to both UInt16 and UInt32.Spatula
@KrzysztofKoźmic No I have not run a profiler and now that this is a question of unique hash values I am going to focus on that.Spatula
It's not the number of samples, it's the fact they are all consecutive; try it with random numbers and I believe you'll see more consistent results.Fondea
@Fondea Understand. But sequential is a pattern that a GetHashCode should deal with. Tuple UInt32 UInt32 subject to the same sequential produces 33792 unique GetHashCode (versus 1024 for KVP). The KVP degradation from UInt32 to UInt16 is not justified by statistics.Spatula
There is really no such thing as a perfect hash code for every situation. Given the constraints of your project, you may be better off implementing your own. I've found this page has some helpful guidelines; it is for Java but the same principles apply. javamex.com/tutorials/collections/…Boanerges
I'm pretty sure this has to do with the default implementation of GetHashCode in ValueType... which only uses the first field in some cases, IIRC. Looking into it.Arms
A
8

Firstly, we can dispense with the timing aspect of this - it feels to me like this is really just about hash collisions, as obviously those will kill the performance.

So, the question is really why there are more hash collisions for KeyValuePair<uint, uint> than KeyValuePair<ushort, ushort>. To help find out a bit more about that, I've written the following short program:

using System;
using System.Collections.Generic;

class Program
{
    const int Sample1 = 100;
    const int Sample2 = 213;

    public static void Main()
    {
        Display<uint, ushort>();
        Display<ushort, ushort>();
        Display<uint, uint>();
        Display<ushort, uint>();
    }

    static void Display<TKey, TValue>()
    {
        TKey key1 = (TKey) Convert.ChangeType(Sample1, typeof(TKey));
        TValue value1 = (TValue) Convert.ChangeType(Sample1, typeof(TValue));
        TKey key2 = (TKey) Convert.ChangeType(Sample2, typeof(TKey));
        TValue value2 = (TValue) Convert.ChangeType(Sample2, typeof(TValue));

        Console.WriteLine("Testing {0}, {1}", typeof(TKey).Name, typeof(TValue).Name);
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value1).GetHashCode());
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value2).GetHashCode());
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value1).GetHashCode());
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value2).GetHashCode());
        Console.WriteLine();
    }
}

The output on my machine is:

Testing UInt32, UInt16
-1888265981
-1888265981
-1888265806
-1888265806

Testing UInt16, UInt16
-466800447
-459525951
-466800528
-459526032

Testing UInt32, UInt32
958334947
958334802
958334802
958334947

Testing UInt16, UInt32
-1913331935
-1913331935
-1913331935
-1913331935

You can obviously try varying the sample values to see where there are collisions.

The results of KeyValuePair<ushort, uint> are particularly worrying, and the results of KeyValuePair<ushort, ushort> are surprisingly good.

In fact, KeyValuePair<ushort, uint> isn't just bad - it's ludicrously bad as far as I can see - I haven't to find any value which doesn't have the same hash code of -1913331935 when running the 64 bit CLR. Running the 32 bit CLR I get a different hash code, but still the same hash code for all values.

It appears that in .NET 4.5 (which is what I'm running) the default implementation of GetHashCode doesn't just take the first instance field of the struct, as previously documented. I suspect that for at least some types, it just uses the first 4 bytes of memory beyond the header in the boxed value (and there will be boxing for every call here), and that ends up sometimes being just the first field (if that field is a uint), sometimes being more than one field (e.g. for ushort, ushort where both fields can fit "inside" 4 bytes) and sometimes being no fields at all (ushort, uint).

(Actually, this doesn't explain why you get 1024 different hash codes in the uint, uint case instead of just 1000. I'm still unsure on that.)

Ultimately, using a value type which doesn't override GetHashCode as a dictionary key seems like it's just a bad idea, unless you've tested to ensure that it's suitable for your specific requirements. There's just too much which is black magic to be confident about it, IMO.

Arms answered 30/9, 2012 at 7:23 Comment(4)
If loop on 1024 X 1024 UInt32 get 1024 unique and each is repeated exactly 1024.Spatula
@Blam: Yes, that makes sense - but when you're only looping 1000x1000, I'd expect there to be only 1000 keys...Arms
Hans deleted his answer but I think he had it. 0,0 - 0,999 produce unique GetHashCode. 8,992 - 8,999 produce unique. 16,992 - 16,999 and 24,992 - 24,999.Spatula
As a test took i=0 and j=0-1024 to see if that matched any of fresh i>0 j=0-1000 and the answer is no. hash UInt32 2281371105 = hash KVP 8,992. 2281371104 = 8,993 2281371107 = 8,994Spatula
J
8

Since GetHashCode returns an Int32, every pair of Int16s (or UInt16s) can easily return a unique value. With a pair of Int32s, you would need to combine the values in some way to be compatible with your design.

KeyValuePair doesn't override GetHashCode(), so you are just using the default implementation of ValueType.GetHashCode(), and the documentation for it says the following:

(from: http://msdn.microsoft.com/en-us/library/system.valuetype.gethashcode.aspx)

If you call the derived type's GetHashCode method, the return value is not likely to be suitable for use as a key in a hash table. Additionally, if the value of one or more of those fields changes, the return value might become unsuitable for use as a key in a hash table. In either case, consider writing your own implementation of the GetHashCode method that more closely represents the concept of a hash code for the type.

Since KeyValuePair doesn't override GetHashCode(), I assume it's not intended to be used as a Dictionary key.

Further, according to this question and this C# code, the default implementation of ValueType.GetHashCode() simply selects the first non-static field, and returns the result of its GetHashCode() method. This explains the high number of duplicates for KeyValuePair<UInt32, UInt32>, although it doesn't explain the lack of duplicates for KeyValuePair<UInt16, UInt16>.

My guess would be that for KeyValuePair<UInt32, UInt32>, GetHashCode() does simply return GetHashCode() of the first value, and that for KeyValuePair<UInt16, UInt16>, GetHashCode() is combining the values resulting in a unique hash for each pair of values, since it is possible and straightfoward to do so.

Jointress answered 30/9, 2012 at 3:12 Comment(2)
I gave you a +1 because what you have written is correct. I suspect that the downvoter did so because you haven't directly answered the question.Uruguay
+1 Direct answer to the problem. Which is actually more valuable than a direct answer to the question.Spatula
A
8

Firstly, we can dispense with the timing aspect of this - it feels to me like this is really just about hash collisions, as obviously those will kill the performance.

So, the question is really why there are more hash collisions for KeyValuePair<uint, uint> than KeyValuePair<ushort, ushort>. To help find out a bit more about that, I've written the following short program:

using System;
using System.Collections.Generic;

class Program
{
    const int Sample1 = 100;
    const int Sample2 = 213;

    public static void Main()
    {
        Display<uint, ushort>();
        Display<ushort, ushort>();
        Display<uint, uint>();
        Display<ushort, uint>();
    }

    static void Display<TKey, TValue>()
    {
        TKey key1 = (TKey) Convert.ChangeType(Sample1, typeof(TKey));
        TValue value1 = (TValue) Convert.ChangeType(Sample1, typeof(TValue));
        TKey key2 = (TKey) Convert.ChangeType(Sample2, typeof(TKey));
        TValue value2 = (TValue) Convert.ChangeType(Sample2, typeof(TValue));

        Console.WriteLine("Testing {0}, {1}", typeof(TKey).Name, typeof(TValue).Name);
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value1).GetHashCode());
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key1, value2).GetHashCode());
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value1).GetHashCode());
        Console.WriteLine(new KeyValuePair<TKey, TValue>(key2, value2).GetHashCode());
        Console.WriteLine();
    }
}

The output on my machine is:

Testing UInt32, UInt16
-1888265981
-1888265981
-1888265806
-1888265806

Testing UInt16, UInt16
-466800447
-459525951
-466800528
-459526032

Testing UInt32, UInt32
958334947
958334802
958334802
958334947

Testing UInt16, UInt32
-1913331935
-1913331935
-1913331935
-1913331935

You can obviously try varying the sample values to see where there are collisions.

The results of KeyValuePair<ushort, uint> are particularly worrying, and the results of KeyValuePair<ushort, ushort> are surprisingly good.

In fact, KeyValuePair<ushort, uint> isn't just bad - it's ludicrously bad as far as I can see - I haven't to find any value which doesn't have the same hash code of -1913331935 when running the 64 bit CLR. Running the 32 bit CLR I get a different hash code, but still the same hash code for all values.

It appears that in .NET 4.5 (which is what I'm running) the default implementation of GetHashCode doesn't just take the first instance field of the struct, as previously documented. I suspect that for at least some types, it just uses the first 4 bytes of memory beyond the header in the boxed value (and there will be boxing for every call here), and that ends up sometimes being just the first field (if that field is a uint), sometimes being more than one field (e.g. for ushort, ushort where both fields can fit "inside" 4 bytes) and sometimes being no fields at all (ushort, uint).

(Actually, this doesn't explain why you get 1024 different hash codes in the uint, uint case instead of just 1000. I'm still unsure on that.)

Ultimately, using a value type which doesn't override GetHashCode as a dictionary key seems like it's just a bad idea, unless you've tested to ensure that it's suitable for your specific requirements. There's just too much which is black magic to be confident about it, IMO.

Arms answered 30/9, 2012 at 7:23 Comment(4)
If loop on 1024 X 1024 UInt32 get 1024 unique and each is repeated exactly 1024.Spatula
@Blam: Yes, that makes sense - but when you're only looping 1000x1000, I'd expect there to be only 1000 keys...Arms
Hans deleted his answer but I think he had it. 0,0 - 0,999 produce unique GetHashCode. 8,992 - 8,999 produce unique. 16,992 - 16,999 and 24,992 - 24,999.Spatula
As a test took i=0 and j=0-1024 to see if that matched any of fresh i>0 j=0-1000 and the answer is no. hash UInt32 2281371105 = hash KVP 8,992. 2281371104 = 8,993 2281371107 = 8,994Spatula
S
1

As other answerers have mentioned, KeyValuePair does not override GetHashCode, and the default implementation of GetHashCode for structs isn't the best. You can instead use two-element tuples for this, e.g.

var dict = new Dictionary<Tuple<uint, uint>, string>();
dict.Add(Tuple.Create(1u, 2u),"xxx"); // Tuples override GetHashCode

Note however this will add additional overhead for the extra Tuple heap allocation. (it's partially made up for though, since when you call GetHashCode on a struct that doesn't override it you implicitly box it)

Smallminded answered 11/8, 2016 at 3:0 Comment(2)
As stated in my question tuple also generate a lot of duplicates. Did you test?Spatula
There's even a GitHub issue about improving the default hash code and equality. Can't find it right now.Comradery
T
0

Bottom rule is to always override GetHashCode if you want a lot of your own stuff put into a has using structure like a dictionary. You can use this extension to see how well a dictionary is filled. It will report empty slots, duplicate keys and such. About to put it on sourceforge, but here it is;

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;

// This unit is Freeware. It was developed by Jerremy Koot & Ivo Tops. July 2011
//
// Version  By    Changes
// =======  ===== ==============================================================
// v1.02    Ivo   Removed not-working Hashtable support and simplified code
// v1.01    Ivo   Lowered memory usage
// v1.00    I&J   First Version

namespace FastLibrary
{
/// <summary>
/// Static Extension Methods for Dictionary, ConcurrentDictionary and HashSet
/// </summary>
public static class ExtHashContainers
{
    /// <summary>
    /// Checks a dictionary for performance statistics
    /// </summary>
    public static string Statistics<TKey, TValue>(this Dictionary<TKey, TValue> source)
    {
        return ExamineData(source.Keys, source);
    }

    /// <summary>
    /// Checks a concurrent dictionary for performance statistics
    /// </summary>
    public static string Statistics<TKey, TValue>(this ConcurrentDictionary<TKey, TValue> source)
    {
        return ExamineData(source.Keys, source);
    }

    /// <summary>
    /// Checks a HashSet for performance statistics
    /// </summary>
    public static string Statistics<TKey>(this HashSet<TKey> source)
    {
        return ExamineData(source, source);
    }

    private static string ExamineData<TKey>(ICollection<TKey> source, Object hashContainer)
    {
        if (!source.Any()) return "No Data found.";

        // Find Buckets
        var b = GetBuckets(hashContainer);
        if (b < 0) return ("Unable to get Buckets Field for HashContainer");

        // Create our counting temp dictionaries
        var d = new int[b];
        var h = new Dictionary<int, int>(source.Count);

        // Find Hash Collisions and Bucket Stats
        foreach (var k in source)
        {
            var hash = k.GetHashCode() & 0x7FFFFFFF; // Hashes are stripped of sign bit in HashContainers
            int bucket = hash%b; // .NET Hashers do not use negative hashes, and use % voor bucket selection
            // Bucket Stats
            d[bucket]++;

            // Hashing Stats
            int c;
            if (h.TryGetValue(hash, out c)) h.Remove(hash);
            else c = 0;
            c++;
            h.Add(hash, c);
        }

        // Do some math
        var maxInBucket = d.Max(q => q);
        var maxSameHash = h.Values.Max(q => q);
        var emptyBuckets = d.Count(q => q == 0);
        var emptyStr = b == 0 ? "0" : ((float) (emptyBuckets)/b*100).ToString("0.0");
        var worstHash = (from i in h where i.Value == maxSameHash select i.Key).FirstOrDefault();

        // Report our findings
        var r = Environment.NewLine + hashContainer.GetType().Name + " has " + b + " buckets with " + source.Count +
                " items. " +
                Environment.NewLine + "The Largest bucket contains " + maxInBucket + " items. " +
                Environment.NewLine + "It has " + (emptyBuckets) +
                " empty buckets (" + emptyStr + "%)" + Environment.NewLine + "Each non-empty bucket has on average " +
                ((source.Count/(float) (b - emptyBuckets))).ToString("0.0") + " items." + "The " + source.Count +
                " items share " + h.Count +
                " unique hashes. ";
        if (maxSameHash > 1)
            r += Environment.NewLine + "The largest collision has " + maxSameHash +
                 " items sharing the same hash, which == " + worstHash;
        return r;
    }

    private static Int32 GetBuckets(object dictionary)
    {
        var type = dictionary.GetType();
        while (type != null && !type.IsGenericType) type = type.BaseType;
        if (type == null) return -1;

        string field = null;
        if (type.GetGenericTypeDefinition() == typeof (Dictionary<,>)) field = "buckets";
        if (type.GetGenericTypeDefinition() == typeof (ConcurrentDictionary<,>)) field = "m_buckets";
        if (type.GetGenericTypeDefinition() == typeof (HashSet<>)) field = "m_buckets";
        if (field == null) return -1;

        var bucketsField = type.GetField(field, BindingFlags.NonPublic | BindingFlags.Instance);
        if (bucketsField == null) return -1;

        var buckets = bucketsField.GetValue(dictionary);
        if (buckets == null) return -1;

        var length = buckets.GetType().GetProperty("Length");
        return (int) length.GetGetMethod().Invoke(buckets, null);
    }
}
}
Twig answered 30/9, 2012 at 10:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.