Why std::hash<int> seems to be identity function

Asked 11/7, 2016 at 10:39 Answered 11/7, 2016 at 12:1

Solved c++debugging c++11 identity stdhash

#include <iostream>

int main() {
    std::hash<int> hash_f;
    std::cout << hash_f(0) << std::endl;
    std::cout << hash_f(1) << std::endl;
    std::cout << hash_f(2) << std::endl;
    std::cout << hash_f(3) << std::endl;
}

I compile with "g++ main.cpp -std=c++11" and then the result is :

Why is this happening? I don't use any library and I don't have a specialized hashing function.

Addendum : I wanted to define the hash for an unordered_set of unordered_set of int with the hash of a set being the sum of its components hashs, but if it's just identity it's not cool because the hash of {2,4} is the same than the hash of {1,5}. The simplest way to avoid that is may be to use the std::hash double function.

Floeter answered 11/7, 2016 at 10:39 Comment(8)

Why do you expect it to not be the identity function? – Gravitative 11/7, 2016 at 10:41

f(x) = x is perfectly unique for every input. :) – Retentivity 11/7, 2016 at 10:50

Identity is the obvious choice for a simple hash over integers. (Obviously, it would be a poor choice for a cryptographic hash - that needs to be a one-way function). – Imitable 11/7, 2016 at 10:58

I don't really understand your addendum. Either way please do not change the question after-the-fact. – Something 11/7, 2016 at 11:55

@dyp: Those are not even the same languages! – Something 11/7, 2016 at 11:55

@LightnessRacesinOrbit These are intentionally not the same languages. Other languages have to solve a similar problem, albeit with different trade-offs. – Brutal 11/7, 2016 at 13:2

This might help: github.com/HowardHinnant/hash_append – Rae 11/7, 2016 at 16:34

The other answers cover the rationale behind the identity function very well. To address your addendum:

I wanted to define the hash of an unordered_set as the sum of its components hashs, but if it's just identity it's not cool because the hash of {2,4} is the same than the hash of {1,5}. The simplest way to avoid that is may be to use the std::hash function.

As you can see, using the + operator to combine hashes is not the best idea. To be more robust, you could use the XOR (^) operator, or take inspiration from the approach taken, e.g., by boost::hash_combine (details in this SO post):

seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

As an example, for your two integer pairs (1,5 / 2,4), and a seed of 0, this would work out to

uint32_t seed = 0;
seed ^= 1 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
seed ^= 5 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
// 3449077526

uint32_t seed = 0;
seed ^= 2 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
seed ^= 4 + 0x9e3779b9 + (seed << 6) + (seed >> 2);
// 3449077584

Oliguria answered 11/7, 2016 at 12:1 Comment(5)

why you can't just use : seed ^= x ? why do you need + 0x9e3779b9 + (seed << 6) + (seed >> 2) ? – Weichsel 11/7, 2016 at 12:4

You can just go for a simple XOR (as I wrote); but have a look at the SO article I linked above for some rationale behind the Boost approach. – Oliguria 11/7, 2016 at 12:8

A simple practical example: combining the hashes of two 0s would in this case be 0 ^ 0, which is again 0. But you could argue that two 0s should produce a different hash from a single 0. Another rationale is that two consecutive hash values should be far apart from each other. – Oliguria 11/7, 2016 at 12:9

Thanks a lot and sorry, I can't select 2 answers, I should have ask an other question for the addendum and select your answer. – Weichsel 11/7, 2016 at 12:16

No worries, it's fine as it is. – Oliguria 11/7, 2016 at 12:30

It seems its identity, its allowed as its distinct.. From cpp reference

The actual hash functions are implementation-dependent and are not required to fulfill any other quality criteria except those specified above. Notably, some implementations use trivial (identity) hash functions which map an integer to itself. In other words, these hash functions are designed to work with unordered associative containers, but not as cryptographic hashes, for example. ....

Wayless answered 11/7, 2016 at 10:47 Comment(0)

It seems completely reasonable for the hash function int→int to be identity, and it's not clear why you are surprised by that. Performing any further computation would be pointless. This is, in fact, a perfect hash in every sense of the term.

Remember, std::hash is supposed to (almost uniquely) identify values, not encrypt them.

It's only when you want to hash types larger than that of the hash itself (say, uint9999999_t) that you need to do some work to "compress" the value into the size of the hash.

Something answered 11/7, 2016 at 11:0 Comment(7)

For an input which is bounded by values greater than int min and smaller than int max, e.g. (-1000, +1000), the distribution will be ... not optimal. – Brutal 11/7, 2016 at 11:3

@dyp: If you have such an input then why are you storing it in an int? You're always free to provide your own range-aware hash function customised for your application, but I see no reason for the default implementation to make any assumptions in that regard. – Something 11/7, 2016 at 11:4

Add a single entry of INT_MAX and you have the same issue, except that you have to use int. To me, it is debatable whether or not the default implementation should try to compensate common skews in the input data; therefore to me the question is, what are the arguments why not to try to do that? – Brutal 11/7, 2016 at 11:18

@dyp: I don't think you can come up with any "common skews". You're chasing optimisations without any data! If the default implementation optimises for case X then it's suddenly suboptimal for case Y. Let the default do the sensible, obvious thing then replace it with your own specific hash function if you need one. – Something 11/7, 2016 at 11:26

Actually, thinking about it again, the modulo applied after the hash will distribute things too, so the impact is not as high as I originally thought. Distributions that collide with a trivial hash functions look much more weird, e.g. (N, N+10) where N is the applied modulo. -- Yes, I don't have any data for "common skews", nor do I know if there are few enough of them to optimize for. But it's not me who decides on a default; a default should be there if it works well enough for many cases and then you need those cases to test. Otherwise, no default should be provided IMHO. – Brutal 11/7, 2016 at 11:53

according to cppreference.com, the hash function object seems to be int-->size_t, not int-->int. – Sharma 11/7, 2016 at 20:5

"Performing any further computation would be pointless. This is, in fact, a perfect hash in every sense of the term." - this completely misses the trade-offs here, namely between identity hashes post modulo table size (which is what matters) being extremely collision prone in the worst case (even with a prime number of buckets), vs saving CPU time on stronger hashing, folding near-incrementing values across buckets more uniformly than stronger hashing does, and (very minor benefit) having better cache locality if lookups happen to be done in order of increasing key. – Godchild 30/5, 2019 at 1:51