What is a good 64bit hash function in Java for textual strings?
Asked Answered
A

10

67

I'm looking for a hash function that:

  1. Hashes textual strings well (e.g. few collisions)
  2. Is written in Java, and widely used
  3. Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
  4. Bonus: Has a 128-bit variant.
  5. Bonus: Not CPU intensive.
Anana answered 2/11, 2009 at 10:35 Comment(1)
The following link has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: partow.net/programming/hashfunctions/index.htmlZurkow
C
75

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

If you're looking for even more bits, you could probably use a BigInteger Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

Caw answered 2/11, 2009 at 11:0 Comment(11)
Note that for Strings <= 5 chars, the upper 32bits will be 0. So this will only make a difference for longer strings.Vaginectomy
Good catch. A higher prime and non-null start value should help.Caw
I've chosen a quite high 64bit prime as a start value. As a result, hashes for Strings <= 5 chars shouldn't be 0 in the first 32bits. However, on second thought I doubt that having 32 0s at the beginning hurt the properties of the hash function. I kept 31 as a second prime as this is what's used in String.hashCode() too (Which still is very close to what I suggest here)Caw
The String.hashCode is only good for fast calculation (h * 31 is optimised to (i << 5) - i), but it is poor for avoiding textual hash collisions (the strings "FB" and "Ea" collide). Since this 64-bit hash is the same, it has essentially the same weakness.Rattlebrain
Regarding use-cases for 64-bit and 128-bit keys: maybe you want a hash as a unique signature for strings, to avoid storing the string content entirely. Here you'd better (a) use a cryptographic-strength hash, and (b) don't forget the "birthday problem" (ask wikipedia) when selecting the hash size. Even a "good" 64-bit hash has to be assumed to have a lower # bits of "security" (say 32) - so together with the BP it starts to look dodgy as a unique signature even for sets of over 100,000 strings. There's a reason git went with 160-bit SHA-1 :-)Rattlebrain
I am trying to generate a hashcode from a password string I get from an input, and this method gives a long number. It is not the same as the md5 hashcode I received for the string when using an online string-to-md5hashcode generator. Can someone show me how I can get the md5 hashcode of a string in Java? For example, if I use an online hashcode egenerator for the string "hello", I get a mixture of letters and number: 5d41402abc4b2a76b9719d911017c592. The method above gives me something like 6487232239950166829.Chobot
@LukeUsherwood You are misinformed. Multiplication is faster in interpreted languages because they get optimized down. If you use shifts like that in an interpreted language, it is slower because it isn't going to be as optimized. This Java hash function is very bad, over 40 years old, and should be obsoleted by newer, better functions.Danilodanio
sorry -1 but bad formula. I tried (at random), and get easily collisions: 0AS and 0B4 -3351804022671247779Rancher
@guillaumegirod-vitouchkina fair enough, I haven't been too high on this answer since I've wrote it in 2009. I've actually thought about deleting it a couple of times already but since no other answer has gained significant support I've thought I leave it as is. Too be fair though, String.hashCode() has a collision for the same values too. Same for all strings with equals suffix followed by either "AS" or "B4" actually.Caw
@guillaumegirod-vitouchkina Reason is that difference between '4' and 'S' is 31, the multiplier used in String.hashCode(). It would probably make sense to use a multiplier that's a prime > Character.MAX_VALUE but that's deviating from the original implementation quite a bit, especially in performance. See https://mcmap.net/q/30351/-why-does-java-39-s-hashcode-in-string-use-31-as-a-multiplierCaw
@guillaumegirod-vitouchkina so the collision will be there in String.hashCode() for any 2 strings with any common prefix and ending in "str" + c1 + c2 and "str" + (c1 + 1) + (c2 - 31).Caw
B
6

An answer for today (2018). SipHash.

It will be much faster than most of the answers here, and significantly higher quality than all of them.

The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

Bis answered 16/1, 2018 at 6:43 Comment(1)
The accepted answer (Java's string hash) is insultingly simple, to the point where it is poor hash. SipHash is in many ways, over-engineered, over complicated and not very fast in interpreted languages due to the sheer amount of required operations and rounds. MurmurHash exists in the middle, where it is fast, but not extremely weak.Danilodanio
V
4

Create an SHA-1 hash and then mask out the lowest 64bits.

Vaginectomy answered 2/11, 2009 at 10:49 Comment(5)
You could even do an XOR of the first and last 64 bits. But isn't a SHA-1 hash a little over the top? If a cryptographically secure hash isn't necessary, you definitely lost some points on requirement 5 ;)Caw
@sfussenegger: Don't try to add random randomness. XOR doesn't always help. Even clipping the hash can have unpredictable results. Give it a try with a few million test cases or understand the mathematics behind it. Otherwise, you just make things worse with such a "blind improvement".Vaginectomy
It's not about adding random randomness. The idea was simply to keep all bits of the SHA-1 hash which is designed to be uniformly distributed. Therefore, there shouldn't be any unexpected side effect - but at second though it's useless overhead in the end. Clipping doesn't have unpredictable results as well - because that's exactly what e.g. HashMap.indexFor(int,int) does to map a hash to an index of the hashed table. So it really doesn't matter if you clip a 128 bit hash to 64 bit, as it will be further clipped to fit the hashed table anyway.Caw
It really depends on the properties of the bits. Folding the hash this way can create unexpected clumps which shouldn't happen with clipping. But clipping can also fail if the various strings produce with very similar lower 32bits. That's why it's usually better not to try to "improve" an existing algorithm without know its exact properties. I'm talking here because I once created a hash which didn't hash very well :)Vaginectomy
I think we all did it at least once ;) I'd be very interested to see if either approach causes problems with SHA-1. Maybe somebody finds some time to run a simple test?Caw
R
2
long hash = string.hashCode();

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.

Update I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.

In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
Romaine answered 2/11, 2009 at 11:42 Comment(4)
-1 Hardware resources is not the issue here - I didn't specify how the hash value to be used, but I promise you it's not "storing n values in a hashmap". I will get collisions if I process enough items before I run into hardware problems.Anana
I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @Romaine still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your almost certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion) is quite impossible though.Caw
The main point here: A couple of people at Sun and all over the world have improved this algorithm over the past ten years. It's unlikely that you can come up with something batter without investing at least a week or so doing an extensive research on the distribution properties of your strings.Vaginectomy
A hash table with a 32-bit hash will start running into problems after about 65,536 entries. A 32-bit hash won't suffice for a hash table with millions of entries. Even concatenated 32-bit hashes don't work; tried it once and experienced a massive slowdownSemidome
P
2

Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.

There are plenty of implementations available on the net if you google "CRC64 Java"

Plot answered 3/6, 2010 at 10:15 Comment(0)
V
1

Do something like this:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test {

    public static void main(String[] args) throws NoSuchAlgorithmException,
            IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);

        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            SomeObject testObject = new SomeObject();

            dos.writeInt(testObject.count);
            dos.writeLong(testObject.product);
            dos.writeDouble(testObject.stdDev);
            dos.writeUTF(testObject.name);
            dos.writeChar(testObject.delimiter);
            dos.flush();

            byte[] hashBytes = md.digest(baos.toByteArray());
            BigInteger testObjectHash = new BigInteger(hashBytes);

            System.out.println("Hash " + testObjectHash);
        } finally {
            dos.close();
        }
    }

    private static class SomeObject {
        private int count = 200;
        private long product = 1235134123l;
        private double stdDev = 12343521.456d;
        private String name = "Test Name";
        private char delimiter = '\n';
    }
}

DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.

Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.

SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe @HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.

I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.

Virulence answered 3/6, 2010 at 10:55 Comment(0)
V
1

Reverse the string to get another 32-bit hashcode and then combine the two:

String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;

This is pseudocode; the String.reverse() method doesn't exist and will need to be implemented some other way.

Vannoy answered 6/8, 2014 at 7:17 Comment(0)
P
0

Do you look at Apache commons lang?

But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...

Plasia answered 2/11, 2009 at 11:12 Comment(1)
Commons-lang won't help at all for hashes larger than the standard 32-bit ones. It does those very well, but beyond that not so much.Virulence
K
0

I suggest using mzHash64, a very simple, fast function with a number of collisions very close to an ideal Universal Hash Function

public static long mzHash64(byte[] data, int start, int length, long seed) {    
    long hash = 0xD45E69F901E72147L ^ seed;

    for(int i = 0; i < length; i++)
        hash = 0x3631754B22FF2D5CL * (i + data[start + i]) ^ (hash << 2) ^ (hash >>> 2);

    return hash;
}

public static long mzHash64(String str) {
    byte[] data = str.getBytes();
    return mzHash64(data, 0, data.length, 0);
}

Source: https://github.com/matteo65/mzHash64

32 bit version: https://github.com/matteo65/mzHash32

Kentkenta answered 17/3 at 5:59 Comment(1)
Can you edit your answer and provide a sample call to method mzHash64 ?Linotype
T
-2

DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.

I'm not aware of a function but here's an idea that might help:

  • Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".

You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.

Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).

Teevens answered 2/11, 2009 at 10:47 Comment(10)
-1 The longer the strings get, the more collisions you'll get. Additionally the hash is weak as (assuming natural language) most strings will contain vowels and frequent consonants (it's even possible to quite reliably guess the language of a string by frequent consonants and vowels by the way).Caw
@sfussenegger: The OP mentions the need to has textual strings well, which implies some upper limit on string length (e.g. longest word in the English language is only 45 characters). Additionally, the OP makes no mention that this needs to be a secure hash.Teevens
Why does this requirement imply an upper limit on string length? This hash function is extremely weak - regardless of being secure or not.Caw
I read "textual strings" to imply natural language. If this isn't the case then I agree my approach is invalid. Why do you think it's weak?Teevens
I suspect that this algo gives lots of clashes with short words: rate == tear, rose == sore etc etc.Termless
@oxbow_lakes: You're correct. In this case perhaps the remaining 12 bits should be used to store the result of String.hashCode() (modulo 4096).Teevens
a) it doesn't work for special characters, numbers and the like b) see @oxbow_lakes' comment plus cases with duplicated characters (lose == loose) c) it won't work for long texts as they will have close to all character flags set ... I might even find more reasons, but that should enough for nowCaw
@Adamski: if String.hashCode() is the strongest part of your hash function, why don't replace the whole function with String.hashCode()? If you change it from 32 to 64 bit - voila - you get to the exact approach I've suggested :) ... and I can't believe somebody voted for this suggestion at all.Caw
@sfussenegger: I agree with all your points. My solution assumes that the OP wants to hash individual natural language words.Teevens
@sfussenegger: The approach you suggested is entirely valid. However, I was merely trying to suggest something that could help the OP if their problem domain was sufficiently narrow. If not then sure, they can call String.hashCode(), but given ripper's rep I imagine he already knows how to do this.Teevens

© 2022 - 2024 — McMap. All rights reserved.