Hash function for floats
Asked Answered
U

8

26

I'm currently implementing a hash table in C++ and I'm trying to make a hash function for floats...

I was going to treat floats as integers by padding the decimal numbers, but then I realized that I would probably reach the overflow with big numbers...

Is there a good way to hash floats?

You don't have to give me the function directly, but I'd like to see/understand different concepts...

Notes:

  1. I don't need it to be really fast, just evenly distributed if possible.

  2. I've read that floats should not be hashed because of the speed of computation, can someone confirm/explain this and give me other reasons why floats should not be hashed? I don't really understand why (besides the speed)

Unsuspecting answered 21/11, 2010 at 13:40 Comment(0)
C
16

It depends on the application but most of time floats should not be hashed because hashing is used for fast lookup for exact matches and most floats are the result of calculations that produce a float which is only an approximation to the correct answer. The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer. This type of check does not lend itself to hashed lookup tables.

EDIT:

Normally, because of rounding errors and inherent limitations of floating point arithmetic, if you expect that floating point numbers a and b should be equal to each other because the math says so, you need to pick some relatively small delta > 0, and then you declare a and b to be equal if abs(a-b) < delta, where abs is the absolute value function. For more detail, see this article.

Here is a small example that demonstrates the problem:

float x = 1.0f;
x = x / 41;
x = x * 41;
if (x != 1.0f)
{
    std::cout << "ooops...\n";
}

Depending on your platform, compiler and optimization levels, this may print ooops... to your screen, meaning that the mathematical equation x / y * y = x does not necessarily hold on your computer.

There are cases where floating point arithmetic produces exact results, e.g. reasonably sized integers and rationals with power-of-2 denominators.

Cysticercoid answered 21/11, 2010 at 13:49 Comment(9)
Could you please explain a little more? "The usually way to check for floating equality is to check if it is within some delta (in absolute value) of the correct answer."Unsuspecting
+1 -- The answer is not to do it in the first place. Don't use floats as keys in maps or hash-tables; you'll run into problems sooner or later.Winded
@Leo Davidson I know I'll run in troubles, the goal of this exercise is to find when exactly ;-)Unsuspecting
Downvote because it doesn't answer the question. I am here because I need non-exact hashes. Advice on dangers is all nice and well, but answering the question is better.Beutner
It is often a disservice to the questioner to simply answer questions as asked. I speak from experience on this forum. Someone asking to hash floats is probably pursuing a wrong course, especially considering the question as asked. If you want to ask a question about fuzzy lookups, about equivalence classes of floating point numbers, and so on that is a different question.Cysticercoid
Just because the asked question is usually the wrong approach doesn't mean it always is. Sometimes programmers need to do things not because it is a good idea, but nonetheless, there are external factors compelling it (e.g. a stubborn boss, shortsighted requirements, etc.). Or there might even be unusual circumstances where it does make sense. Answering the question as-is would be a disservice, but part of the purpose of Stack Overflow is to provide a knowledge base for future Google searchers. The actual question shouldn't be entirely ignored.Nabob
@PresidentJamesK.Polk Just because you’re unaware of valid use cases doesn’t mean there’re no such use cases. For instance, STL files en.wikipedia.org/wiki/STL_(file_format) contain a soup of triangles, yet modern 3D GPUs are performing best on indexed meshes: vertices shared across triangles save gigaflops of computations in vertex shaders.Fuzzy
@Soonts: What is the evidence that the OP's case was one of those use cases? Where do I claim that there are no such use cases?Cysticercoid
I was just optimizing VertexArray/IndexArray builder, which uses std::map and takes 12 seconds to build the map for 20M triangles with 60M vertices, resulting in 8M unique vertices. std::unordered_map handles same workload in 3.6 seconds with really bad hash function, was just Googling around and found this, not very useful I must say. Yes, PJKP is correct but the answer is off the rails. Precise comparison for VB/IB building is the correct thing to do as the unique vertices are built from the same values collected from the triangle soup and combined in the VB.Alodie
P
13

You can use the std hash, it's not bad:

 std::size_t myHash = std::cout << std::hash<float>{}(myFloat);
Polak answered 30/9, 2016 at 11:4 Comment(0)
Z
12

If your hash function did the following you'd get some degree of fuzziness on the hash lookup

unsigned int Hash( float f )
{
    unsigned int ui;
    memcpy( &ui, &f, sizeof( float ) );
    return ui & 0xfffff000;
}

This way you'll mask off the 12 least significant bits allowing for a degree of uncertainty ... It really depends on yout application however.

Zomba answered 21/11, 2010 at 14:3 Comment(7)
No, 0xfffff000 masks off 3 nibbles, which is 12 bits. Probably a little too much. If you want to mask off 3 bits, use 0xfffffff8.Veriee
@FredOverflow: No .. you are right .. I didn't mean 3 ... mind failure there. changedZomba
@Goz: this depends on the internal representation of float on the target machine though, since you assume here than the mantissa is located in the least significant bits, and is stored in little-endian fashion. Though the idea of fuzziness is definitely the way to go.Coulee
You're still going to end up with pairs of numbers with a very small relative difference that end up in different bins.Wherefore
@Ben: Correct, but that only happens in 1 of 4096 cases (if "very small" means direct neighbors in the set of float values). Plus, this solution has the beautiful property of equality being transitive, that is, Hash(x) == Hash(y) and Hash(y) == Hash(z) implies Hash(x) == Hash(z), which does not hold if you use the good old delta trick of comparing floats directly. I really like this solution!Veriee
@Matthieu: You are, of course, quite right. However for it to become a problem a big endian float would have to get stored as a little endian (or vice-versa). Otherwise the mask is just in the correct endian format too and everything is good. Seeing as I haven't yet come across a platform that has different endian rules for float and integers I'm going to say your point is worth bearing in mind but largerly irrelevant.Zomba
@Ben: Of course you will. If you are bucketing, which a hash algorithm necessarily does, you will always have this issue. imagine buckets at every 0.1 that go to 0.05 either side. that means that 1.4999999 goes in 1 bucket and 1.5 goes in another. You just have to live with that or ditch any form of bucketing ...Zomba
T
7

You can of course represent a float as an int type of the same size to hash it, however this naive approach has some pitfalls you need to be careful of...

Simply converting to a binary representation is error prone since values which are equal wont necessarily have the same binary representation.

An obvious case: -0.0 wont match 0.0 for example. *

Further, simply converting to an int of the same size wont give very even distribution, which is often important (implementing a hash/set that uses buckets for example).

Suggested steps for implementation:

  • filter out non-finite cases (nan, inf) and (0.0, -0.0 whether you need to do this explicitly or not depends on the method used).
  • convert to an int of the same size
    (that is - use a union for example to represent the float as an int, not simply cast to an int).
  • re-distribute the bits, (intentionally vague here!), this is basically a speed vs quality tradeoff. But if you have many values in a small range you probably don't want them to in a similar range too.

*: You may wan't to check for (nan and -nan) too. How to handle those exactly depends on your use case (you may want to ignore sign for all nan's as CPython does).

Python's _Py_HashDouble is a good reference for how you might hash a float, in production code (ignore the -1 check at the end, since that's a special value for Python).

Tag answered 16/2, 2015 at 21:12 Comment(6)
The obvious case of “-0.0 wont match 0.0 for example” is the only example of a pair of floating-point values that are equal for == and have differing representations, so I am not sure why you make a generality out of it. Infinities certainly do not need to be filtered out. Some have (seriously) recommended to return a random integer for hash(NaN), but it seems more sound to simply treat the use of NaN as key in a hashtable as an error: research.swtch.com/randhashIsaiasisak
PS: the blog post I linked to was posted on April 1st. I didn't realize this because I read it from the archives. It may not be serious, but at the same time, a random result for hash(NaN) means the binding(s) with NaN as key are present in the hashtable and can be iterated on, so it is actually a good solution for some usecases.Isaiasisak
@Pascal Cuoq - Exactly how you deal with !finite values is up to your own implementation, I'm simply stating you should be aware of them when hashing floats, and simply converting a float to an int as is suggested in other answers is overlooking rather a lot. re: -0 vs 0 - there is -nan / nan but how to class these may depend on your own preference (you may want to ignore the sign of a nan as Python does). Updated the answer.Tag
Note, would strongly suggest NOT to return a random integer from nan. that was suggested as a joke for a reason. keep the hash function deterministic, or use some kind of error is more a implementation detail.Tag
It does not seem like a joke to me. “Deterministic” means that x == y implies hash(x) == hash(y), which remains true for NaN and NaN even if hash(NaN) is defined as random. The reasons why it is desirable not to return the same value very time are much stronger than some misconception about “deterministic” and are explained in the blog post.Isaiasisak
You could draw a distinction between "numeric equality" and "value identity" here, If you impliment a set which makes use of a hash, having nan show up multiple times isn't useful. Of course you could make some argument for it, but even then its possible nan will randomly give the same hash multiple times.... dont do it. At that point its a discussion between you and whoever uses the data structure. But as I said before, there is a reason this isnt common practice.Tag
V
6
unsigned hash(float x)
{
    union
    {
        float f;
        unsigned u;
    };
    f = x;
    return u;
}

Technically undefined behavior, but most compilers support this. Alternative solution:

unsigned hash(float x)
{
    return (unsigned&)x;
}

Both solutions depend on the endianness of your machine, so for example on x86 and SPARC, they will produce different results. If that doesn't bother you, just use one of these solutions.

Veriee answered 21/11, 2010 at 13:45 Comment(7)
Aren't there some standard functions that can be used to grab the mantissa and exponent? I'm not a float kind of guy, or much of C++ guy either, so I was just wondering ...Cysticercoid
@GregS: Not as far as I know. Why would you want to grab the mantissa and exponent, anyway? A float is 32 bit, why not simply interpret that as an unsigned? As long as you avoid NaNs, you should be fine...Veriee
@FredOverflow: I was just guessing that grabbing the mantissa and exponent separately would produce less machine- and compiler-dependent results. I would still depend on the sizes of the mantissa and the exponent which might turn out to be just as compiler and machine dependent.Cysticercoid
@FredOverflow making a use of an uninitialized variable wouldn't return 2 different results for the same value passed in parameter? And also, what does the second way do exactly..?Unsuspecting
@Pacane: Are you referring to the variable u? The union hack is based upon the assumption that f and u share the same memory, hence u is "initialized" by writing to f. Yes, this is highly implementation-specific, but it usually works. The second way is a reference-cast. It introduces another way to access the object x (of type float) as if it were an object of type unsigned.Veriee
@FredOverflow About the reference-cast, is there some lost data in the process, or it just pads the number until it's unsigned?Unsuspecting
@Pacane: What do you mean by "pad"? There is no value conversion going on whatsoever, if that's what you think. For example, hash(3.14f) does not yield 3, but 1078523331, because both values are represented by the machine word 0x4048f5c3. Of course this assumes that int and float both are 32 bit types, which is highly implementation-specific etc. (You can think of the reference cast as basically shorthand for *(unsigned*)&x.)Veriee
G
1

If you're interested, I just made a hash function that uses floating point and can hash floats. It also passes SMHasher ( which is the main bias-test for non-crypto hash functions ). It's a lot slower than normal non-cryptographic hash functions due to the float calculations.

I'm not sure if tifuhash will become useful for all applications, but it's interesting to see a simple floating point function pass both PractRand and SMHasher.

The main state update function is very simple, and looks like:

function q( state, val, numerator, denominator ) {
  // Continued Fraction mixed with Egyptian fraction "Continued Egyptian Fraction"
  // with denominator = val + pos / state[1]
  state[0] += numerator / denominator;
  state[0] = 1.0 / state[0];

  // Standard Continued Fraction with a_i = val, b_i = (a_i-1) + i + 1
  state[1] += val;
  state[1] = numerator / state[1];
}

Anyway, you can get it on npm Or you can check out the github

Using is simple:

const tifu = require('tifuhash');

const message = 'The medium is the message.';
const number = 333333333;
const float = Math.PI;

console.log( tifu.hash( message ), 
  tifu.hash( number ),
  tifu.hash( float ),
tifu.hash( ) );

There's a demo of some hashes on runkit here https://runkit.com/593a239c56ebfd0012d15fc9/593e4d7014d66100120ecdb9

Side note: I think that in future using floating point,possibly big arrays of floating point calculations, could be a useful way to make more computationally-demanding hash functions in future. A weird side effect I discovered of using floating point is that the hashes are target dependent, and I surmise maybe they could be use to fingerprint the platforms they were calculated on.

Gavotte answered 12/6, 2017 at 8:12 Comment(0)
M
1

I believe it is important to remember the following. When you implement a hash function to be used in a hash table, you must also provide the equality operator to go together, and they must be in sync.

Suppose you have implemented a hash function like this:

uint32_t hash(float f);

and provided an equality operator too:

bool equal(float f1, float f2);

You need to ensure that they behave like this:

if hash(v1) != hash(v2) then equal(v1, v2) == false and also

if equal(v1, v2) == true then hash(v1) == hash(v2).

As mentioned in other posts, floating point types have special values that mess up the hashing, such as -0, +0, quiet NAN, signaling NAN. These have different binary representation. If you where to choose a hash function that casts bits of float to an 32 bit int, then you cannot use only regular boolean operator == to implement equal(), because std::bit_cast<float, uint32_t>(-0) != std::bit_cast<float, uint32_t>(+0), but -0 == +0. At the same time std::bit_cast<float, uint32_t>(QNAN) == std::bit_cast<float, uint32_t>(QNAN), while QNAN == QNAN is false.

Usually we don't want to distinguish +0 and -0, so we need to filter them out before bit_cast, and also we want to account for accidental NAN value getting into out data, then we would have to check if both values are NAN and return true.

Assuming that you have a good hash for uint32_t, say called hash_uint32, the following combination of hash and equals should work well together:

uint32_t hash(float f) 
{ 
  float normalized_f = f != 0 ? f : 0.0;
  return hash_uint32(std::bit_cast<float, uint32_t>(normalized_f));
}

bool equal(float f1, float f1) 
{
  return f1 == f2 || 
    std::bit_cast<float, uint32_t>(f1) == std::bit_cast<float, uint32_t>(f2);
}

These ensure that -0 and +0 are treated as same number and also that strange NAN values would not break your hash table forever.

Maloney answered 10/10, 2023 at 23:45 Comment(0)
M
0

Because of the IEEE byte ordering the Java Float.hashCode() and Double.hashCode() do not give good results. This problem is wellknown and can be adressed by this scrambler:

class HashScrambler {

    /**
     * https://sites.google.com/site/murmurhash/
     */
    static int murmur(int x) {
        x ^= x >> 13;
        x *= 0x5bd1e995;
        return x ^ (x >> 15);
    }

}

You then get a good hash function, which also allows you to use Float and Double in hash tables. But you need to write your own hash table that allows a custom hash function.

Since in a hash table you need also test for equality, you need an exact equality to make it work. Maybe the later is what President James K. Polk intends to adress?

Michaelemichaelina answered 12/12, 2020 at 0:18 Comment(2)
OP asked for C++, not javaFarwell
@Farwell I can help in this difficult task of porting from Java to C++ =) unsigned hashf(float x) { union { float f; unsigned u; }; f = x; u ^= u >> 13; u *= 0x5bd1e995; return u ^ (u >> 15); }Smtih

© 2022 - 2024 — McMap. All rights reserved.