Does std::hash guarantee equal hashes for "equal" floating point numbers?
Asked Answered
T

4

9

Is the floating point specialisation of std::hash (say, for doubles or floats) reliable regarding almost-equality? That is, if two values (such as (1./std::sqrt(5.)/std::sqrt(5.)) and .2) should compare equal but will not do so with the == operator, how will std::hash behave?

So, can I rely on a double as an std::unordered_map key to work as expected?


I have seen "Hashing floating point values" but that asks about boost; I'm asking about the C++11 guarantees.

Toluene answered 18/2, 2013 at 19:25 Comment(9)
"should compare equal" wut?Arboreal
They had better have different hashes, otherwise the hashing is broken.Oestradiol
An answer to this question says it's implementation dependent: #7758893Remotion
@DavidSchwartz Um, unequal objects having the same hash value is pretty much unavoidable, and the source of 99% of what goes into a hash table.Hampshire
@delnan: Of course. But it's critical that it not be possible for a simple, common algorithm to trivially produce large numbers of values with the same hash. (Except, of course, one maliciously constructed specifically to collide.) It's trivial and typical to produce "almost equal" floats and want to store them in the same collection.Oestradiol
@Toluene The reason that your two expressions don't compare as equal is NOT that there are two binary expressions of 0.2, but that there is NO exact (finite) binary representation of 0.2, or sqrt(5) !Solidago
The requirement is that the same value produce the same hash value. A good hash function produces different hash values for different values to the extent possible. And, no, 1./sqrt(5.)/sqrt(5.) is not necessarily equal to 0.2, and if it's not, hashing it is not required to produce the same value as hashing 0.2.Cymene
@DavidSchwartz I wouldn't expect a large number of "almost equal" floats in most cases, at least for reasonably small epsilon (just cutting off after is likely awful, I agree, but cutting off after the sixth decimal digit or something may be perfectly okay).Hampshire
@delnan: It's not a matter of "most cases", it's a matter of every reasonable case. A large number of almost equal floats is a reasonable case. In fact, it's one of the most common test cases for hash functions.Oestradiol
A
12

std::hash has same guarantees for all types over which it can be instantiated: if two objects are equal, their hash codes will be equal. Otherwise, there's a very large probability that they won't. So you can rely on a double as a key in an unordered_map to work as expected: if two doubles are not equal (as defined by ==), they will probably have a different hash (and even if they don't, they're different keys, because unordered_map also checks for equality).

Obviously, if your values are the results of inexact calculations, they aren't appropriate keys for unordered_map (nor perhaps for any map).

Askew answered 18/2, 2013 at 19:42 Comment(4)
Ah, your point about unordered_map checking for strict equality is a very good one (I neglected that). So even if the std::hash would behave as I intended, I wouldn't have gained anything as == would "break" it again. Thanks.Toluene
Can you give an example number that would be a problematic key that could actually occur? Which combination of bits of a double is ambiguous? NaN's with unusual mantissa bit patterns? This answer reinforces fear of floating point, doesn't it?Thorner
@Thorner I gave one in the question.Toluene
@Toluene If you used integers between -(2^53) to +(2^53) for the key it will work perfectly, double can represent that entire range exactly. It can represent more than that exactly, anything with about 50 significant binary digits will be perfect. I don't see a problem with two differently rounded calculations being not equal in your question, what did you expect? 1/3 can't be written down in base ten either (0.33333....) that doesn't mean everything is wrong. I can write 1/3 exactly in base 3 though: 0.1. Same issue between base 10 and base 2.Thorner
S
9

Multiple problems with this question:

  • The reason that your two expressions don't compare as equal is NOT that there are two binary expressions of 0.2, but that there is NO exact (finite) binary representation of 0.2, or sqrt(5) ! So in fact, while (1./std::sqrt(5.)/std::sqrt(5.)) and .2 should be the same algebraically, they may well not be the same in computer-precision arithmetic. (They aren't even in pen-on-paper arithmetic with finite precision. Say you are working with 10 digits after the decimal point. Write out sqrt(5) with 10 digits and calculate your first expression. It will not be .2.)

  • Of course you have a sensible concept of two numbers being close. In fact you have at least two: One absolute (|a-b| < eps) , one relative. But that doesn't translate into sensible hashes. If you want all numbers within eps of each other to have the same hash, then 1, 1+eps, 1+2*eps, ... would all have the same hash and therefore, ALL numbers would have the same hash. That is a valid, but useless hash function. But it is the only one that satisfies your requirement of mapping nearby values to the same hash!

Solidago answered 18/2, 2013 at 19:43 Comment(3)
You have convinced me: every double should hash to 7.Arboreal
I am under the impression that the author is aware of these things, and the answer is only around-, not on-topic.Fatsoluble
@Fatsoluble That's for the author to say. Reading his post and comments, I was under the impression that these are points that required elaborting on. Both you and the author are free to downvote my answer if you deem it irrelevant.Solidago
T
5

Behind the default hashing of an unordered_map there is a std::hash struct which provides the operator() to compute the hash of a given value.

A set of default specializations of this templates is available, including std::hash<float> and std::hash<double>.

On my machine (LLVM+clang) these are defined as

template <>
struct hash<float> : public __scalar_hash<float>
{
    size_t operator()(float __v) const _NOEXCEPT
    {
        // -0.0 and 0.0 should return same hash
       if (__v == 0)
           return 0;
        return __scalar_hash<float>::operator()(__v);
    }
};

where __scalar_hash is defined as:

template <class _Tp>
struct __scalar_hash<_Tp, 0> : public unary_function<_Tp, size_t>
{
    size_t operator()(_Tp __v) const _NOEXCEPT
    {
        union
        {
            _Tp    __t;
            size_t __a;
        } __u;
        __u.__a = 0;
        __u.__t = __v;
        return __u.__a;
    }
};

Where basically the hash is built by setting a value of an union to the source value and then getting just a piece which is large as a size_t.

So you get some padding or you get your value truncated, but that doesn't really matter because as you can see the raw bits of the number are used to compute the hash, this means that it works exactly as the == operator. Two floating numbers, to have the same hash (excluding collision given by truncation), must be the same value.

Tweak answered 12/12, 2014 at 4:52 Comment(0)
O
2

There is no rigorous concept of "almost equality". So behavior can't be guaranteed in principle. If you want to define your own concept of "almost equal" and construct a hash function such that two "almost equal" floats have the same hash, you can. But then it will only be true for your particular notion of "almost equal" floats.

Oestradiol answered 18/2, 2013 at 19:29 Comment(14)
I'd say that it's a mathematical impossibility to define a sensible notion of "almost equal" that will allow a well-behaved hash function respecting that notion.Solidago
It's hard to imagine a concept of "almost equal" for which this actually could work, unless we just use the same hash for all values. I mean, if 1.00000 is "almost equal" to 1.00001 is "almost equal" to 1.00002 and so on, then any two numbers will be linked by a chain of "almost equal" pairs.Ultra
Well, there are different representation of the exact same value in floating point numbers, 0 is one example, 0.2 is another (see OP). So I would expect these to compare equal, if a hash function is provided. Otherwise the implementation basically performs a bit by bit comparison, which is useless for floating points.Toluene
@Toluene - no, there is only one representation for 0.0, a different representation for -0.0, and only one representation for 0.2. What the OP is talking about is a computation that, because of finite precision, ends up with a value that is not 0.2, even though algebraically it would be.Cymene
@ruakh: Yeah, no. You can compare values with a granularity of epsilon to decide if they are equal or not. As far as I can tell, that would separate 1.00000 and 1.00001 even transitively.Toluene
@PeteBecker: For starters, I am the OP, so I should better know what I'm asking :) Second, -0 and 0 are equal values, but as you say they don't compare equal. As does 0.2 and that fraction.Toluene
@Toluene - -0 and 0 are "equal values" for some definitions of "equal" and not for others. 1.0/-0.0 is negative infinity; 1.0/0.0 is positive infinity. This is a pretty good indication that they are not "equal" (for some definitions of "equal"). And 0.2 and that fraction are not necessarily the same value, because the calculation of that fraction uses truncated values. It's just like int a = 1/3; int b = a * 3; not setting b to 1, but a little less obvious.Cymene
@bitmask: No, that would not separate 1.00000 and 1.00001 transitively. Think about it -- at each stage going from 1.00000 to 1.00001, you're allowed to add at least the full value of epsilon. You'll never hit, say, an impassible barrier that you can only approach asymptotically.Ultra
@Solidago It's actually not very difficult. Just partition the set of all values in some way, convert each value to a normalized (representative) value of the partition its in, and use this normalized value for hashing and comparison. But you have to partition; you cannot use +/-epsilon.Askew
@Toluene 0.0 and -0.0 do compare equal. And the results of your expression are not guaranteed to be 0.2---for that matter, on most machines, they are guaranteed not to be 0.2, because 0.2 isn't representable.Askew
@James I am aware of that. But once you partition your range, you get useless behaviour at the boundaries. Two values on different sides of a boundary will then have different hashes, no matter how close they are. That defeats the entire point of the construction.Solidago
@Solidago Whether you get useless behavior at the boundaries depends on what you're doing (or have done) with the values. If the operations are such that you should come out with an integral value, for example, partitioning can make a lot of sense.Askew
@PeteBecker: My point was that double is an approximation of the real numbers. And you can very easily show that the additive inverse of zero is zero, hence 0 == -0.Toluene
@Toluene - analyzing real numbers is a very bad way of understanding how double works precisely because double is an approximation of real numbers. Yes, 0.0 == -0.0, because that's convenient. But they have different representations, and can give different results when plugged into the same calculation.Cymene

© 2022 - 2024 — McMap. All rights reserved.