Can Java's hashCode produce same value for different strings?
Asked Answered
V

12

47

Is it possible to have same hashcode for different strings using java's hashcode function?or if it is possible then what is the % of its possibility?

Vernita answered 11/4, 2012 at 8:22 Comment(0)
C
70

A Java hash code is 32bits. The number of possible strings it hashes is infinite.

So yes, there will be collisions. The percentage is meaningless - there is an infinite number of items (strings) and a finite number of possible hashes.

Conte answered 11/4, 2012 at 8:26 Comment(6)
So, can i say that it can produce 2^32 different hashes and after that it will repeat the hashcodes??Vernita
If you manage to identify 2^32 strings that all have a different hashcode, then yes, any other string not in that list will have the same hashcode as one in that list.Conte
On a side note, this is called the pigeonhole principle en.wikipedia.org/wiki/Pigeonhole_principleGeldens
You'd probably go through much fewer than 2^32 strings (around 2^16 strings) before hitting a collision. The reason why is related to the birthday paradox: betterexplained.com/articles/understanding-the-birthday-paradoxOverflight
"The number of possible strings it hashes is infinite." Strings in Java have a maximum size because they use a char array and arrays in Java (using the standard JVM) have a maximum size. Therefore the number of possible strings is not infinite.Doering
You CAN associate a chance to a collision. There are 2^32 possible hashcodes, so the chance of two different Strings having the same hashcode is 1/2^32, that is one in about 4 billion. According the birthday paradox principle, you need to make about 77,162 Strings to have a collision, see: en.wikipedia.org/wiki/Birthday_attackLiquate
S
33

YES. A lot.

Look at following pair

  • "FB" and "Ea"

can return same hash code even though the characters in it are not same.

Basically it is the sum of characters in a string multiplied by an integer.

Shafer answered 11/4, 2012 at 8:26 Comment(4)
That is incorrect. Each character is multiplied by a different number so anagrams would not necessarily return the same value.Payday
Sorry my bad! Corrected with a common example.Shafer
Why same hashcode for them? They are two different strings... :SVernita
@Vernita The String hashCode Javadocs define the algorithm used to generate the hash code for a string: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]. If you use this algorithm for both "FB" and "Ea", the result is 2236 for each.Ible
I
9

if it is possible then what is the % of its possibility?

That is not a particularly meaningful question.

However, unless there is some systemic bias in the String::hashcode function or the way that you are generating the String objects, the probability that any two randomly generated String objects will have the same hash code will be 1 in 232.

This assumes that the Strings are chosen randomly from the set of all possible String values. If you restrict the set in various ways, the probability will vary from the above number.

(For instance, the existence of the "FB" / "Ea" collision means that the probability of a collision in the set of all 2 letter strings is higher than the norm. And we could chose a set such that the probability for a hash collision for non-equal strings is zero. Trivially the set "A" and "B".)


Another thing to note is that the chance of 232 different strings chosen at random (from a much larger unbiased set of strings) having no hash collisions is vanishingly small. To understand why, read the Wikipedia page on the Birthday Paradox.

In reality, the only way you are going to get no hash collisions in a set of 232 different strings is if you select or generate the strings in a way that is designed to avoid this. Even forming the set by selecting randomly generated strings and rejecting colliders is going to be computationally expensive. To produce such a set efficiently, you would need to exploit the properties of the String::hashCode algorithm, which (fortunately) is specified.

Ia answered 11/4, 2012 at 8:41 Comment(3)
So, can i say that for 2^32 different strings, the hashcode function will always produce different hashcode?Vernita
@Zara Actually it even says quite the opposite! Having 2^32 different strings you will most likely have a collision (or even several..).Elinorelinore
@jory - yes you are right. It's an example of the birthday paradox. (It is not entirely impossible that 2^32 different randomly generated strings will all have different hashcodes. Just vanishingly improbable.)Ia
T
8

Yes, it is entirely possible. The probability of a string (or some other object type -- just assuming you'll be using strings in this example) having the same hashcode as some other string in a collection, depends on the size of that collection (assuming that all strings in that collection are unique). The probabilities are distributed as follows:

  • With a set of size ~9,000, you'll have a 1% chance of two strings colliding with a hash in the set
  • With a set of size ~30,000, you'll have a 10% chance of two strings colliding with a hash in the set
  • With a set of size ~77,000, you'll have a 50% chance of two strings colliding with a hash in the set

The assumptions made, are:

  • The hashCode function has no bias
  • Each string in the aforementioned set is unique

This site explains it clearly: http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/ (Look at "the second thing you should know")

Taxable answered 19/6, 2014 at 11:47 Comment(1)
What's the set of characters for the strings that they've tested there?Consonantal
T
7

This wouldn't directly answer your question, but I hope it helps.

The below is from the source code of java.lang.String.

/**
 * Returns a hash code for this string. The hash code for a
 * <code>String</code> object is computed as
 * <blockquote><pre>
 * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
 * </pre></blockquote>
 * using <code>int</code> arithmetic, where <code>s[i]</code> is the
 * <i>i</i>th character of the string, <code>n</code> is the length of
 * the string, and <code>^</code> indicates exponentiation.
 * (The hash value of the empty string is zero.)
 *
 * @return  a hash code value for this object.
 */
public int hashCode() {
    int h = hash;
    int len = count;
    if (h == 0 && len > 0) {
    int off = offset;
    char val[] = value;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}
Townswoman answered 11/4, 2012 at 8:25 Comment(0)
M
6

Yes, it is possible for two Strings to have the same hashcode. If you take a look at the now-deleted Wikipedia article, you will see that both "FB" and "Ea" have the same hashcode. There is nothing in the method contract saying a hashCode() should be used to compare for equality, you want to use equals() for that.

Since Java 1.2, String implements hashCode() by using a product sum algorithm over the entire text of the string.

Monger answered 11/4, 2012 at 8:30 Comment(0)
S
4

Yes, by definition of the pigeon-hole concept, two different strings can produce the same hashcode and code should always be written to cater for such conditions (typically, by not breaking.)

Stopper answered 11/4, 2012 at 8:25 Comment(0)
A
3

The percentage of collisions for random strings should be minimal. However, if you hash strings from external sources, an attacker could easily create hundreds of thousands of strings having the same hashcode. In a java HashMap these would all map to the same bucket and effectively turn the map into a linked list. Access times to the map would then be proportional to the map size instead of constant, leading to a denial of service attack.

See this page on Effective DoS attacks against Web Application Plattforms for further information links to the presentation.

Airboat answered 11/4, 2012 at 11:28 Comment(1)
The risk of serious damage by those attacks can be mitigated by Java 8's improvement over HashMap when the number of collisions is high in hashcodes that map to the same bucket.Tonsillectomy
R
2

Yes this is possible, because one of the contract between equals() & hashCode() method of Object class is.......... If two object are not equal according to equals() method then there is no guaranty that their hashCode will be same, the hashCode may/may not be equal. i.e, if obj1.equals(obj2) return false then obj1.hashCode()==obj2.hashCode() may/may not return true. Example:

    String str1 = "FB";
    String str2 = "Ea";
    System.out.println(str1.equals(str2));// false
    System.out.println(str1.hashCode() == str2.hashCode()); // true
Reproof answered 1/6, 2018 at 9:19 Comment(2)
Can u explain why is it soColorant
Because this is one of the contract between equals() and hashCode() method. If two object are not equal according to equals() method then there is no guaranty that their hashCode will be same. Please have look the java doc docs.oracle.com/javase/7/docs/api/java/lang/…Reproof
F
1

//You can run the below code with -Xmx2100m and can get multiple results, enough to fill your console

`

import java.util.HashMap;

public class TestHashCollision {
        public static void main(String[] args) {
        final String TEXT = "was stored earlier had the same hash as";
        HashMap<Integer,String> hs=new HashMap<>();
        long t1=System.currentTimeMillis();
        long t2=System.currentTimeMillis();
        for(long l=0;l<Long.MAX_VALUE;l++) {
            String key="d"+l;
            if(hs.containsKey(key.hashCode())) {
                System.out.println("'"+hs.get(key.hashCode())+"' "+TEXT+" '"+key+"'");//System.exit(0);
            } else {
                hs.put(key.hashCode(),key);
            }
            t2=System.currentTimeMillis();
            if(t2-t1>10000) {
                t1=System.currentTimeMillis();
                System.out.println("10 seconds gone! size is:"+hs.size());
            }
        }
        System.out.println("Done"); 
    }
}

`

Food answered 20/12, 2018 at 23:9 Comment(0)
C
1

Yes(not just in Java, it applies to any language), it can produce the same hash-code for different strings. I am recalling a rule taught by my professor, it might be useful here -

Two same strings/value must have the same hashcode, but the converse is not true.

example in python

>>> hash('same-string')
-5833666992484370527
>>> hash('same-string')
-5833666992484370527

There might be another string which can match the same hash-code, so we can't derive the key using hash-code.

The reason for two different string to have the same hash-code is due to the collision. enter image description here

Ciceronian answered 14/1, 2019 at 19:58 Comment(0)
B
1
"tensada".hashCode()
"friabili".hashCode());

The java hash function return equal values here.

Brooklet answered 12/10, 2020 at 17:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.