Good hash function for a 2d index
Asked Answered
L

3

16

I have a struct called Point. Point is pretty simple:

struct Point
{
    Row row;
    Column column;

    // some other code for addition and subtraction of points is there too
}

Row and Column are basically glorified ints, but I got sick of accidentally transposing the input arguments to functions and gave them each a wrapper class.

Right now I use a set of points, but repeated lookups are really slowing things down. I want to switch to an unordered_set.

So, I want to have an unordered_set of Points. Typically this set might contain, for example, every point on a 80x24 terminal = 1920 points. I need a good hash function. I just came up with the following:

struct PointHash : public std::unary_function<Point, std::size_t>
{
    result_type operator()(const argument_type& val) const
    {
        return val.row.value() * 1000 + val.col.value();
    }
};

However, I'm not sure that this is really a good hash function. I wanted something fast, since I need to do many lookups very quickly. Is there a better hash function I can use, or is this OK?

Lee answered 14/4, 2010 at 3:25 Comment(0)
A
20

Following the technique is given in Effective Java (2nd edition), and quoted from there in Programming in Scala. Have a prime constant (we'll say 53 but you may find something larger will give more even distribution here), and perform multiplication and addition as follows:

(53 + int_hash(row)) * 53 + int_hash(col)

For more values (say you add a z coordinate), just keep nesting, like

((53 + int_hash(row)) * 53 + int_hash(col)) * 53 + int_hash(z)

Where int_hash is a function for hashing a single integer. You can visit this page to find a bunch of good hash functions for single integers.

Aleuromancy answered 14/4, 2010 at 3:32 Comment(0)
L
2

I guess doing a bitshift by 10 instead would be more efficient than multiplying by 1000.

return (val.row.value()<<10) + val.col.value();
Logia answered 14/4, 2010 at 3:28 Comment(6)
Don't prematurely optimize. #1 This is a microoptimization, and it's unlikely to save you much time. #2 it just makes the code more obscure. #3 If your compiler's smart, you can pick a number like 1024 instead of 1000, and your compiler will do the optimization automatically if it actually makes sense on your machine's instruction set.Aleuromancy
@Ken: I do agree with not prematurely optimizing in general, but for a simple hashing function I don't agree. It is a hash function or a mathematical function.Logia
Also, it doesn't matter whether val.row.value() * 1000 is greater than val.column.value() because this is a hash code, so the only reason to compute it is to put the points in a somewhat random location in the hash table. Having overlaps and things like that helps matters.Aleuromancy
I agree that it's unlikely to make much (or any) performance difference, but IMO the bitshift form is clearer because it's exactly what he's trying to do - he doesn't actually want to know what the result is when multiplied by 1000, he wants to move some bits out of the way of others which is what a bitshift indicates. I'd find that a lot more intuitive if trying to debug a hash function than a multiply.Wales
@Brian Just to prove point #2, you got the operator precedence wrong. Your code is equivalent to val.row.value() << (10 + val.col.value()) (which would be a very bad hash function indeed, since most values would map to bucket 0 after a modulus is taken). This is why it's not a good idea to mix bitwise operations with arithmetic operations, and why it's not a good idea to prematurely optimize in general.Aleuromancy
@Ken: fixed, but I don't agree with any of your points. But that's why you downvoted me, basically I don't see this as an optimization and don't see why multiplication is more clear. If you want 4x the apples then you would use multiplication, if you are computing a mathematical function then who cares. Anyway, moving on...Logia
A
2

With a small enough domain, you might be able to come up with a perfect hash function. Or perhaps just use a 2 dimensional array. For larger data amounts, use a prime number based multiplication and mod to your table size (and if your table is a base 2 number in size). This eliminates the divide/mod that can be costly on smaller, embedded type systems.

Or find any number of integer based hash functions that already exist. Make sure you measure any hash function you create for collision. Enough collisions will eliminate any gains over O(n log n) methods such as maps/trees.

Attire answered 14/4, 2010 at 4:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.