Good Hash Function for Strings

Asked 12/4, 2010 at 17:57 Answered 20/5, 2018 at 18:48

192

I'm trying to think up a good hash function for strings. And I was thinking it might be a good idea to sum up the unicode values for the first five characters in the string (assuming it has five, otherwise stop where it ends). Would that be a good idea, or is it a bad one?

I am doing this in Java, but I wouldn't imagine that would make much of a difference.

Adriene answered 12/4, 2010 at 17:57 Comment(8)

Good hash functions depend heavily on the input to the hash, and the requirements of the algorithm. Such a hash will not be very good if all your strings start with the same five characters, for example. It will also tend to result in a normal distribution. – Recluse 12/4, 2010 at 17:58

Possible duplicate of 98153 – Occupation 12/4, 2010 at 17:59

Why can't you use String's own hashCode()? – Minefield 12/4, 2010 at 18:1

@WhirlWind, true, I'm not sure what the strings will have, other than that it will probably english text. – Adriene 12/4, 2010 at 18:3

@Barl, mainly because my professor told us to implement our own hash functor...and the reason I didn't want to use Java's, was because it was generic, and I would imagine a more specific hash functor would be better. – Adriene 12/4, 2010 at 18:4

Does this answer your question? What's the best hashing algorithm to use on a stl string when using hash_map? – Carlita 11/1, 2020 at 13:33

I have just been fighting against a mysterious bug for several hours. Turned out it's a hash collision with String.hashCode(). Hashing to int seems a bad idea. – Weatherman 6/4, 2020 at 0:16

See also this Q&A with lots of good hash algorithms mentioned here: hash function for string – Edana 23/2, 2022 at 6:40

183

Usually hashes wouldn't do sums, otherwise stop and pots will have the same hash.

and you wouldn't limit it to the first n characters because otherwise house and houses would have the same hash.

Generally hashs take values and multiply it by a prime number (makes it more likely to generate unique hashes) So you could do something like:

int hash = 7;
for (int i = 0; i < strlen; i++) {
    hash = hash*31 + charAt(i);
}

Piteous answered 12/4, 2010 at 18:1 Comment(5)

@Piteous How can you say that it always gives you a unique hash key. Is there any mathematical proof ? I think we have to take mod of hash with another bigger prime number, otherwise overflow problem occurs. – Tellurite 4/9, 2013 at 5:56

@Tellurite He didn't say always unique, he said more likely to be unique. As for why, a quick search on google reveals this article: computinglife.wordpress.com/2008/11/20/… explaining why 31 was used for Java string hashing. There is no mathematical proof given, but it does explain the general concept as to why primes work better. – Medrek 24/9, 2013 at 0:39

Thanks a lot for clarifying the idea of doing better hashing. Just to double check - The hashCode() return value will be used by Java to map to some table index before storing the object. So, if the hashCode() returns m, it does something like (m mod k) to get an index of the table of size k. Is that right? – Odonto 17/10, 2015 at 23:6

This was amazing, you have no idea how much it helped. I understand that using 31 gives the best unique results, but is the 7 also the best possible? Or did you just pick a random prime number? – Courtund 26/4, 2021 at 18:48

I decreased my collisions by taking the final result mod the length of the string. (Im working in python so I had to change it up a little bit) – Artillery 6/10, 2021 at 21:0

162

If it's a security thing, you could use Java crypto:

import java.security.MessageDigest;

MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(stringToHash.getBytes());
String stringHash = new String(messageDigest.digest());

Clausius answered 12/4, 2010 at 18:28 Comment(8)

Nice. I have a machine-learning application, doing statistical NLP over a large corpus. After a few initial passes of morphological normalization on the original words in the text, I throw away the string values and use hash codes instead. Throughout my entire corpus, there are about 600,000 unique words, and using the default java hashcode function, I was getting about 3.5% collisions. But if I SHA-256 the string value and then generate a hashcode from the digested string, the collision ratio is less than 0.0001%. Thanks! – Sensorium 18/11, 2010 at 18:54

@Sensorium One in a million is far too large... is "less than 0.0001%" an oblique way of saying "exactly 0"? I really doubt that you saw a SHA-256 collision because that has never been observed, anywhere, ever; not even for 160-bit SHA-1. If you have two strings that produce the same SHA-256, the security community would love to see them; you'll be world-famous... in a very obscure way. See Comparison of SHA Functions – Bicarb 26/2, 2014 at 1:3

@TimSylvester, you misunderstood. I didn't find SHA-256 collisions. I computed the SHA-256 and then fed the resultant byte sequences into a typical Java "hashCode" function, because I needed a 32-bit hash. That's where I found the collisions. Nothing remarkable :) – Sensorium 27/2, 2014 at 5:9

Isn't there a difference between 'hashing' and 'encrypting'? I understand MessageDigest is a one way hashing function, right? Also, when I used the function, I got the hashed string as a lot of junk UTF characters when I opened the file in LibreOffice. Is it possible to get the hashed string as a random bunch of alphanumeric characters instead of junk UTF characters? – Phonetist 13/11, 2016 at 17:54

String encryptedString and stringToEncrypt.getBytes() refer to encryption, when this really is a hashing algorithm. – Monovalent 10/5, 2017 at 17:43

My generated values look like this: "��'x�-E#��G��]%�c�3��u�" , is this normal thing and can I rely on this? – Faustina 21/11, 2017 at 4:44

if you need reasonable characters, try for instance base64 (I use apache commons) – Hager 24/2, 2018 at 7:7

you can convert the byte array into hex string to get an understandable string. here is how you do it. new BigInteger(1,byteArr).toString(16) – Pleuron 6/4, 2021 at 14:38

You should probably use String.hashCode().

If you really want to implement hashCode yourself:

Do not be tempted to exclude significant parts of an object from the hash code computation to improve performance -- Joshua Bloch, Effective Java

Using only the first five characters is a bad idea. Think about hierarchical names, such as URLs: they will all have the same hash code (because they all start with "http://", which means that they are stored under the same bucket in a hash map, exhibiting terrible performance.

Here's a war story paraphrased on the String hashCode from "Effective Java":

The String hash function implemented in all releases prior to 1.2 examined at most sixteen characters, evenly spaced throughout the string, starting with the first character. For large collections of hierarchical names, such as URLs, this hash function displayed terrible behavior.

Jailer answered 12/4, 2010 at 18:9 Comment(1)

If one is using a double-hashed collection, it may be worthwhile to have the first hash be really quick and dirty. If one has a thousand long strings, half of which a are mapped by a crummy function to one particular value, and half of which are mapped to distinct values, performance in a single-hashed table would be bad, but performance in a double-hashed table, where the second hash examined the entire string, could be almost twice that of a singly-hashed table (since half the strings wouldn't have to be fully hashed). None of the standard Java collections do double hashing, though. – Ostentation 7/3, 2014 at 23:37

If you are doing this in Java then why are you doing it? Just call .hashCode() on the string

Pelt answered 12/4, 2010 at 18:1 Comment(3)

I'm doing it as part of the class, and part of the assignment is to write up several different hash functions. The professor told us to get outside help for the 'better' ones. – Adriene 12/4, 2010 at 18:6

If you need your has to be consistent across JVM versions and implementations, you should not rely on .hashCode(). Rather, use some known algorithm. – Berwickupontweed 26/3, 2013 at 19:9

The algorithm for String::hashCode is specified in the JDK, so it's as portable as the very existence of the class java.lang.String. – Chronaxie 9/9, 2015 at 12:31

Guava's HashFunction (javadoc) provides decent non-crypto-strong hashing.

Griqua answered 17/1, 2013 at 20:57 Comment(0)

This function provided by Nick is good but if you use new String(byte[] bytes) to make the transformation to String, it failed. You can use this function to do that.

private static final char[] hex = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f' };

public static String byteArray2Hex(byte[] bytes) {
    StringBuffer sb = new StringBuffer(bytes.length * 2);
    for(final byte b : bytes) {
        sb.append(hex[(b & 0xF0) >> 4]);
        sb.append(hex[b & 0x0F]);
    }
    return sb.toString();
}

public static String getStringFromSHA256(String stringToEncrypt) throws NoSuchAlgorithmException {
    MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
    messageDigest.update(stringToEncrypt.getBytes());
    return byteArray2Hex(messageDigest.digest());
}

May be this can help somebody

Leitman answered 10/10, 2012 at 9:28 Comment(1)

You could just pass the byte array to messageDigest.update(). – Ideologist 5/3, 2015 at 13:59

FNV-1 is rumoured to be a good hash function for strings.

For long strings (longer than, say, about 200 characters), you can get good performance out of the MD4 hash function. As a cryptographic function, it was broken about 15 years ago, but for non cryptographic purposes, it is still very good, and surprisingly fast. In the context of Java, you would have to convert the 16-bit char values into 32-bit words, e.g. by grouping such values into pairs. A fast implementation of MD4 in Java can be found in sphlib. Probably overkill in the context of a classroom assignment, but otherwise worth a try.

Earthward answered 12/4, 2010 at 21:37 Comment(1)

This hash function is so much better then the one that comes with java. – Trappings 17/11, 2019 at 11:29

// djb2 hash function
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++)
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

source Logic behind djb2 hash function - SO

Pipestone answered 12/4, 2010 at 18:2 Comment(1)

I think it's just a prime number to start at, so that we have fewer collisions. – Reticule 6/12, 2013 at 3:39

If you want to see the industry standard implementations, I'd look at java.security.MessageDigest.

"Message digests are secure one-way hash functions that take arbitrary-sized data and output a fixed-length hash value."

Bedeck answered 12/4, 2010 at 18:35 Comment(0)

sdbm:this algorithm was created for sdbm (a public-domain reimplementation of ndbm) database library

static unsigned long sdbm(unsigned char *str)
{   
    unsigned long hash = 0;
    int c;
    while (c = *str++)
            hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

Shantay answered 19/3, 2014 at 18:31 Comment(0)

         public String hashString(String s) throws NoSuchAlgorithmException {
    byte[] hash = null;
    try {
        MessageDigest md = MessageDigest.getInstance("SHA-256");
        hash = md.digest(s.getBytes());

    } catch (NoSuchAlgorithmException e) { e.printStackTrace(); }
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < hash.length; ++i) {
        String hex = Integer.toHexString(hash[i]);
        if (hex.length() == 1) {
            sb.append(0);
            sb.append(hex.charAt(hex.length() - 1));
        } else {
            sb.append(hex.substring(hex.length() - 2));
        }
    }
    return sb.toString();
}

Mcphee answered 29/1, 2013 at 2:31 Comment(0)

This will avoid any collision and it will be fast until we use the shifting in calculations.

 int k = key.length();
    int sum = 0;
    for(int i = 0 ; i < k-1 ; i++){
        sum += key.charAt(i)<<(5*i);
    }

Bazil answered 20/5, 2018 at 18:48 Comment(0)

-1

Its a good idea to work with odd number when trying to develop a good hast function for string. this function takes a string and return a index value, so far its work pretty good. and has less collision. the index ranges from 0 - 300 maybe even more than that, but i haven't gotten any higher so far even with long words like "electromechanical engineering"

int keyHash(string key)
{
    unsigned int k = (int)key.length();
    unsigned int u = 0,n = 0;

    for (Uint i=0; i<k; i++)
    {
        n = (int)key[i];
        u += 7*n%31;
    }
    return u%139;
}

another thing you can do is multiplying each character int parse by the index as it increase like the word "bear" (0*b) + (1*e) + (2*a) + (3*r) which will give you an int value to play with. the first hash function above collide at "here" and "hear" but still great at give some good unique values. the one below doesn't collide with "here" and "hear" because i multiply each character with the index as it increases.

int keyHash(string key)
{
    unsigned int k = (int)key.length();
    unsigned int u = 0,n = 0;

    for (Uint i=0; i<k; i++)
    {
        n = (int)key[i];
        u += i*n%31;
    }
    return u%139;
}

Senaidasenalda answered 6/2, 2014 at 21:42 Comment(0)

-1

Here's a simple hash function that I use for a hash table I built. Its basically for taking a text file and stores every word in an index which represents the alphabetical order.

int generatehashkey(const char *name)
{
        int x = tolower(name[0])- 97;
        if (x < 0 || x > 25)
           x = 26;
        return x;
}

What this basically does is words are hashed according to their first letter. So, word starting with 'a' would get a hash key of 0, 'b' would get 1 and so on and 'z' would be 25. Numbers and symbols would have a hash key of 26. THere is an advantage this provides; You can calculate easily and quickly where a given word would be indexed in the hash table since its all in an alphabetical order, something like this: Code can be found here: https://github.com/abhijitcpatil/general

Giving the following text as input: Atticus said to Jem one day, “I’d rather you shot at tin cans in the backyard, but I know you’ll go after birds. Shoot all the blue jays you want, if you can hit ‘em, but remember it’s a sin to kill a mockingbird.” That was the only time I ever heard Atticus say it was a sin to do something, and I asked Miss Maudie about it. “Your father’s right,” she said. “Mockingbirds don’t do one thing except make music for us to enjoy. They don’t eat up people’s gardens, don’t nest in corn cribs, they don’t do one thing but sing their hearts out for us. That’s why it’s a sin to kill a mockingbird.

This would be the output:

0 --> a a about asked and a Atticus a a all after at Atticus
1 --> but but blue birds. but backyard
2 --> cribs corn can cans
3 --> do don’t don’t don’t do don’t do day
4 --> eat enjoy. except ever
5 --> for for father’s
6 --> gardens go
7 --> hearts heard hit
8 --> it’s in it. I it I it’s if I in
9 --> jays Jem
10 --> kill kill know
11 --> 
12 --> mockingbird. music make Maudie Miss mockingbird.”
13 --> nest
14 --> out one one only one
15 --> people’s
16 --> 17 --> right remember rather
18 --> sin sing said. she something sin say sin Shoot shot said
19 --> to That’s their thing they They to thing to time the That to the the tin to
20 --> us. up us
21 --> 
22 --> why was was want
23 --> 
24 --> you you you’ll you
25 --> 
26 --> “Mockingbirds ” “Your ‘em “I’d

Tongue answered 1/6, 2017 at 9:25 Comment(1)

A good hash function distributes the values equally across the buckets. – Janaye 22/12, 2017 at 21:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags