Can hashcodes of short string be same?
Asked Answered
B

2

5

I have short Strings (less than 10 characters). I will convert it into int and use it as primary key. (I can't use String primary key because of small problems.) I know the hashcodes of Strings of infinite length can collide, but can short Strings collide too?

Britnibrito answered 12/9, 2014 at 1:5 Comment(2)
#10102837Intense
See drmaciver.com/2008/07/javalangstringhashcode - lots of collisions from Scrabble words. Some statistics: "There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, [only] 274 of these strings (or about 7% of them) don’t collide with something else. Oh well. It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects."Intense
E
13

Absolutely yes. For example, Ea and FB are colliding strings, each only two characters in length! Example:

public static final void main(String[] args) {
    System.out.println("Ea".hashCode() + " " + "FB".hashCode());
}

Prints 2236 2236.


The Java String#hashCode function isn't really even close to random. It's really easy to generate collisions for short strings, and it doesn't fare much better on long strings.

In general, even if you stuck to just 6 bits per character (ASCII letters and numbers, and a couple of symbols), you'd exceed the possible values of the 32-bit hash code with only a 6-character string -- that is, you would absolutely guarantee collisions among the 2^36 6-character 6-bit strings.

Entrust answered 12/9, 2014 at 1:10 Comment(5)
+1 for a direct [counter-]example. It would be interesting to see a string/hash collision graph, which probably does exist already..Intense
And in fact, this can be used to create a denial-of-service attack that turns HashMap's O(1) behavior into O(N): stackoverflow.com/questions/8669946Corkboard
@Corkboard problem solved with Java 8, released a few months before your comment ;^)Lorrainelorrayne
@Lorrainelorrayne What made you pop on this question all of a sudden? :-PCorkboard
@Corkboard searched for some other, string hashing related stuff and this was within the search resultsLorrainelorrayne
S
5

A hash code is 32 bits in size.

A char in Java is 16 bits in size.

So in theory, all 2-character strings could have different hash codes, although some of those hash codes would have to collide with the hash code of the empty string and single-character strings. So even taking "all strings of two characters or shorter" there will be collisions. By the time you've got 10 characters, there are way more possible strings than there are hash codes available.

Collisions are still likely to be rare, but you should always assume that they can happen.

Singlet answered 12/9, 2014 at 1:7 Comment(4)
+1 for pointing out that its not (necessarily) simply java having a stupid hash function, but a fundamental issue.Excite
Also probably worth noting that "rare" doesn't mean that rare. Using the birthday paradox, a collision wouldn't be unusual with a collection of 16k items, even with a well distributed hash. (See en.wikipedia.org/wiki/Birthday_attack for accurate numbers.)Excite
@MichaelAnderson: I guess it depends on what you mean by rare. The chances of there being at least one collision may be pretty high, but the chances of any specific pair colliding are fairly small. Regardless, I'm saying that you've got to account for it anyway, so it doesn't matter too much ;)Singlet
@Lorrainelorrayne I think you're missing the point I was trying to make. The other answers show that the distribution of Javas hash values for strings are bad for simple strings. This answer points out that even with a perfect hash distribution, we should expect collisions.Excite

© 2022 - 2024 — McMap. All rights reserved.