How to specialize std::hash<T> for user defined types?
Asked Answered
H

4

20

The Question

What is a good specialization of std::hash for use in the third template parameter of std::unordered_map or std::unordered_set for a user defined type for which all member data types already have a good specialization of std::hash?

For this question, I define "good" as simple to implement and understand, reasonably efficient, and unlikely to produce hash table collisions. The definition of good does not include any statements about security.

The State of What is Google'able

At the moment, two StackOverflow questions are the first hits for a Google search of "std hash specialization".

The first, How to specialize std::hash::operator() for user-defined type in unordered containers?, addresses whether it is legal to open the std namespace and add template specializations.

The second, How to specialize std::hash for type from other library, essentially addresses that same question.

This leaves the current question. Given that implementations of the C++ Standard Library defined hash functions for primitive types and types in the Standard Library, what is a simple and effective way of specializing std::hash for user defined types? Is there a good way to combine hash functions provided by a Standard Library implementation?

(Edit thanks to dyp.) Another question on StackOverflow addresses how to combine a pair of hash functions.

The other Google results are of no more help.

This Dr. Dobbs article states that XOR of two satisfactory hashes will produce a new satisfactory hash.

This articles seems to speak from knowledge and implies many things but is light on details. It contradicts the Dr. Dobbs article in a brief remark in the first example, saying that using XOR to combine hash functions makes for a weak resulting hash function.

Because XOR applied to any two equal values results in 0, I can see why XOR by itself is weak.

The Meta Question

A well reasoned answer explaining why this question is invalid and cannot be answered generally would also be welcome.

Hypsometer answered 23/6, 2014 at 8:53 Comment(8)
Maybe you should add to the list questions about combining hashes?Manyplies
Ah, good catch, I didn't find that one. That might actually be an answer.Hypsometer
Hm, I'm not sure. That answer is for two values, I do not know if the quality of the algorithm is good enough when recursively applied to N values. It seems even tuple isn't hashable with the standard facilities, see https://mcmap.net/q/301030/-generic-hash-for-tuples-in-unordered_map-unordered_setManyplies
We're working on it in the standard, but for now it's tricky. open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html has a good approach, but it makes it slightly harder for the compiler to optimize. Hopefully we'll be able to work that out in the next 6 months (sorry, standards are slow) and put something in the next experimental release.Mendelevium
@Manyplies I didn't realize that about tuple! I just tried it, and indeed, my compiler barfed, interesting.Hypsometer
There is a public domain partial implementation of open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html here: github.com/HowardHinnant/hash_append/blob/master/hash_append.h and lots of example code using it here: github.com/HowardHinnant/hash_append It isn't completely implementable by you, which is why it must be standardized. However I am using it well enough in a real-world project right now. It eliminates the combining step, and allows you to choose and easily switch out what hash algorithm is used, even for primitive types.Equivalence
Bloomberg has just open-sourced their production quality N3980 implementation: github.com/bloomberg/bde/blob/master/groups/bsl/bslh/doc/…Gamin
Possible duplicate of Good Hash Function for StringsHannie
B
6

One easy way is to use boost::hash library and extend it for your type. It has a nice extension function hash_combine (std::hash lacks that) that allows easy composition of hashes of individual data members of your structures.

In other words:

  1. Overload boost::hash_value for your own type.
  2. Specialize std::hash for your own type and implement it using boost::hash_value.

This way you get the best of std and boost worlds, both std::hash<> and boost::hash<> work for your type.


A better way is to use the proposed new hashing infrastructure in N3980 Types Don't Know #. This infrastructure makes hash_combine unnecessary.

Benevolence answered 23/6, 2014 at 9:33 Comment(7)
The problem is that the documented algorithm used by hash_combine is a very poor one.Sausage
@JamesKanze You could contribute and make it better.Benevolence
@JamesKanze I wonder what metrics you use to tell a poor hash combiner from a good one?Benevolence
Mathematical analysis. I don't know of any way of guaranteeing that one is good, but shifting (rather than multiplying by an odd number) means that the initial elements will eventually be shifted out, and have no impact on the hash.Sausage
Boost's 'hash_combine' shifts both right and left, and xor's the result with the initial value, so it won't lose initial values. You generally test hash functions with a program like code.google.com/p/smhasher/wiki/SMHasher. I haven't personally tested Boost's function.Mendelevium
64.25, sort of. (Integer arithmetic doesn't obey quite the same rules as floating point or real.) What is actually happening is that you are shifting out information from earlier elements out. To avoid this, the multiplier must be odd. (I generally use 127, but this has been chosen empirically---other values are probably just as good. Although smaller values can lead to clustering near the beginning, if the table is very large, and the number of elements very small.)Sausage
@JamesKanze Is not (seed << 6) + (seed >> 2) essentially seed * 64.25?Benevolence
S
3

First, the Dr. Dobbs article which says that XOR of two satisfactory hashes will produce a satisfactory hash is simply wrong. This is a good recipe for poor hashes. In general, to create a good hash, you start by decomposing your object into subobjects, each of which there exists a good hash, and combining the hashs. One simple way of doing this is something like:

class HashAccumulator
{
    size_t myValue;
public:
    HashAccumulator() : myValue( 2166136261U ) {}
    template <typename T>
    HashAccumulator& operator+=( T const& nextValue )
    {
        myValue = 127U * myValue + std::hash<T>( nextHashValue );
    }
    HashAccumulator operator+( T const& nextHashValue ) const
    {
        HashAccumulator results( *this );
        results += nextHashValue;
        return results;
    }
};

(This has been designed so that you can use std::accumulate if you have a sequence of values.)

Of course, this supposed that all of the subtypes have good implementations of std::hash. For the basics types and strings, this is a given; for your own types, just apply the above rule recursively, specializing std::hash to use the HashAccumulator on its subtypes. For a standard container of a basic type, it's a bit trickier, because you're not (formally, at least) allowed to specialize a standard template on a type from the standard library; you'll probably have to create a class which uses of HashAccumulator directly, and explicitly specify that if you need a hash of such a container.

Sausage answered 23/6, 2014 at 11:26 Comment(1)
See this answer to a related question for why XOR is not a good way of combining hashes: https://mcmap.net/q/127646/-why-is-xor-the-default-way-to-combine-hashesPreexist
P
2

Your operation is required to

  • Return a value of type size_t
  • Be consistent with the == operator.
  • Have a low probability of a hash collision for unequal values.

There is no explicit requirement that the hash values are evenly distributed over the range of size_t integers. cppreference.com notes that

some implementations [of the standard library] use trivial (identity) hash functions which map an integer to itself

Avoiding hash collisions coupled with that weakness means that a specialization of std::hash for your types should never simply use a (fast) bitwise XOR (^) to combine the sub hashses of your data-members. Consider this example:

 struct Point {
    uint8_t x;
    uint8_t y;
 };

 namespace std {
    template<>
    struct hash< Point > {
       size_t operator()(const Point &p) const {
          return hash< uint8_t >(p.x) ^ hash< uint8_t >(p.y);
       }
    };
 }

The hashes of p.x will be in the range [0,255], and so will the hashes of p.y. Therefore the hashes of a Point will also be in the range [0,255], with 256 (=2^8) possible values. There are 256*256 (=2^16) unique Point objects (a std::size_t will typically support 2^32 or 2^64 values). So the probability of a hash collision for a good hashing function should be approximately 2^(-16). Our function gives the probability of a hash collision of just under 2^(-8). This is terrible: our hash provides only 8 bits of information, but a good hash should provide 16 bits of information.

If the hashing functions of your data-members provide only hash values in the low parts of the std::size_t range, you must "shift" the bits of the component hash before combining them, so they each contribute independent bits of information. Doing a left-wise shift looks simple

       return (hash< uint8_t >(p.x) << 8) ^ hash< uint8_t >(p.y);

but that will discard information (due to overflow) if the implementation of hash< uint8_t > (in this case) attempts to spread the hash-code values over the std::size_t range.

Accumulating component hash-code values using a multiply-by-prime-and-add method, as typically done in Java, probably works better in general:

 namespace std {
    template<>
    struct hash< Point > {
       size_t operator()(const Point &p) const {
          const size_t prime = 257;
          size_t h {hash< uint8_t >(p.x)};
          h = h * prime + hash< uint8_t >(p.y);
          return h;
       }
    };
 }
Preexist answered 5/12, 2017 at 16:29 Comment(0)
M
1

Until we get a library in the standard to help with this:

  1. Download a modern hasher, like SpookyHash: http://burtleburtle.net/bob/hash/spooky.html.
  2. In the definition of std::hash<YourType>, create a SpookyHash instance, and Init it. Note that picking a random number at either process startup or std::hash construction, and using that as the initialization will make it slightly harder to DoS your program, but doesn't fix the problem.
  3. Take each field in your structure that contributes to operator== ("salient field"), and feed it into SpookyHash::Update.
    • Beware types like double: they have 2 representations as char[] that compare ==: -0.0 and 0.0. Also beware types that have padding. On most machines, int doesn't, but it's hard to tell if a struct will. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#is_contiguously_hashable discusses this.
    • If you have sub-structures, you'll get a faster, higher-quality hash value from recursively feeding their fields into the same SpookyHash instance. However, this requires adding a method to those structures or manually extracting the salient fields: if you can't do this, it's acceptable to just feed their std::hash<> value into the top-level SpookyHash instance.
  4. Return the output of SpookyHash::Final from std::hash<YourType>.
Mendelevium answered 23/6, 2014 at 9:24 Comment(1)
Bloomberg has open-sourced their production quality N3980 implementation: github.com/bloomberg/bde/blob/master/groups/bsl/bslh/doc/… The SpookyHash implementation is found here: github.com/bloomberg/bde/blob/master/groups/bsl/bslh/…Gamin

© 2022 - 2024 — McMap. All rights reserved.