What Exactly is Hash Collision
Asked Answered
Q

4

23

Hash Collision or Hashing Collision in HashMap is not a new topic and I've come across several blogs and discussion boards explaining how to produce Hash Collision or how to avoid it in an ambiguous and detailed way. I recently came across this question in an interview. I had lot of things to explain but I think it was really hard to precisely give the right explanation. Sorry if my questions are repeated here, please route me to the precise answer:

  1. What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?
  2. What exactly causes Hash Collision - the bad definition of custom class' hashCode() method, OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone, OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?
  3. Does anything go wrong or unexpected when Hash Collision happens? I mean is there any reason why we should avoid Hash Collision?
  4. Does Java generate or at least try to generate unique hashCode per class during object initiation? If no, is it right to rely on Java alone to ensure that my program would not run into Hash Collision for JRE classes? If not right, then how to avoid hash collision for hashmaps with final classes like String as key?

I'll be greateful if you could please share you answers for one or all of these questions.

Quote answered 21/8, 2017 at 11:11 Comment(14)
I suggest to reformulate this question. Instead of "share a wonderful article of blog" (which would be off-topic for StackOverflow), why not ask for the correct ansers and an explanation?Cordiform
Interview questions, like homework questions, are things you should tackle yourself. If you post them as question here on StackOverflow, you have to provide your own solution, and point out where your solution is stuck or lacking. As it stands, this is merely a question showing lack of research.Counterattraction
See TAOCP: Volume 3, Chapter 6.4 Hashing.Yarkand
there are a lot of articles about hash collisions, why not look at them first?Dethrone
@Counterattraction sorry, if my question seemed like a homework. If you could please consider the fact that I didn't ask what is hash collision or how to avoid id, rather my questions were more on the crisp points which were not very clearly written in most of the blogs.Quote
You don't, in general, "avoid" hash collision, but rather it's something which you usually assume can and will happen with your hashmap. The workaround which Java collection's HashMap uses is to maintain a tree of colliding values (used to be a list of values). If you find yourself having too many collisions, you might want to consider resizing the number of buckets in your hashmap. And your question is too broad, I agree with the other comments, you should narrow down what you want to ask here.Jennee
@SarthakMittal unfortunately I didn't realize the fact from reading several blogs that whether Hash Collision is a feature that Java is capable of handling or is it a pain that a developer must avoid. This one is not really clear to me.Quote
You did specify that these were interview questions. So they are the same as homework questions. And crisp or not - they do not follow the guidelines of SO (and I'm very surprised people upvoted this question). For one, you should always post a single question in a single post.Counterattraction
@Counterattraction on a funny note, if you are asked in an interview how to make a Mars Rover, that doesn't become an interview question or homework :)Quote
Hashing is mapping a "thing" to a numeric value. A hash collision is simply where that numeric value is equal for two unequal "things".Aurangzeb
@TimBiegeleisen I made a note from your response - resizing the number of buckets in hashmap could help me avoid Hash Collision or reduce its chances. Thanks.Quote
A detailed answer to your first question will answer all your questions, except the ones that come down to "should I trust that programmers of a popular 20-year-old language know what they're doing?". A detailed answer to your first question can be found by typing "what is a hash collision" into Google.Aristotle
@Dukeling I have read most of the links appearing at the top of the search. But these are the confusions which remained unclear to me, hence I asked the questions. Sorry, if the questions bothered you or hurt - atleast my last question was not really meant to incriminate Java language and its developers. For example if I ask, "should I rely on String class alone to get a set of ready to use string functions", that doesn't mean I am questioning the credibility of the developers of String class. I am basically looking for an answer like "there exists StringUtils from Apache Commons".Quote
@EJP RealSkeptic do you want me to drop the Questions altogether? I am not interviewing you guys! Sorry I never meant to offense anyone expect to clarify few doubts which otherwise an Expert friend could have answered. I don't have an expert friend so posted it here. Was it that bad?Quote
L
12

What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?

It's a feature. It arises out of the nature of a hashCode: a mapping from a large value space to a much smaller value space. There are going to be collisions, by design and intent.

What exactly causes Hash Collision - the bad definition of custom class' hashCode() method,

A bad design can make it worse, but it is endemic in the notion.

OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone,

No.

OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?

This doesn't really make sense. Hashes are bound to collide sooner or later, and poor algorithms can make it sooner. That's about it.

Does anything go wrong or unexpected when Hash Collision happens?

Not if the hash table is competently written. A hash collision only means that the hashCode is not unique, which puts you into calling equals(), and the more duplicates there are the worse the performance.

I mean is there any reason why we should avoid Hash Collision?

You have to trade off ease of computation against spread of values. There is no single black and white answer.

Does Java generate or atleast try to generate unique hasCode per class during object initiation?

No. 'Unique hash code' is a contradiction in terms.

If no, is it right to rely on Java alone to ensure that my program would not run into Hash Collision for JRE classes? If not right, then how to avoid hash collision for hashmaps with final classes like String as key?

The question is meaningless. If you're using String you don't have any choice about the hashing algorithm, and you are also using a class whose hashCode has been slaved over by experts for twenty or more years.

Lisandra answered 21/8, 2017 at 11:24 Comment(7)
Thanks for taking time to answer each of the questions. I am just trying to understand the answers alongside what I already knew and learned. Before accepting this one as the 'answer' I would eagerly wait to see a few more answers or perhaps a little more words from you on the question "Does anything go wrong or unexpected when Hash Collision happens? I mean is there any reason why we should avoid Hash Collision?"Quote
@Quote I don't mean to deface EJP's good answer +1, but when a hash collision happens, you run the risk of the performance of the map decreasing. Why is this? Because now when you have a key, you can't just lookup the value, each key now points to some sort of secondary data structure (e.g. a list, a tree, etc.). In the worst case, a Java 8 HashMap will perform the same as a balanced binary tree. Not terrible, but certainly not O(1) lookup, rather O(lgN) lookup.Jennee
@TimBiegeleisen thanks again, now I can match what I learned earlier. I learned Hash Collision results in degraded performance and better to be avoided always, if possible. I read in an article that Balanced Binary Tree is introduced in Java 8. Prior to that it was chaining done though Linked List. Even Java 8 also maintains Linked List and moves bucket to balanced binary tree only when the collided key (duplicates) count exceeds a certain threshold (configurable) which is set to 8 by default. Is this information correct?Quote
@Quote Yes, that sounds correct, possibly taken directly from the Javadoc. But you don't "avoid" collisions, the way you would avoid a car accident by not driving drunk or driving in a snow storm. Rather, the implementation of HashMap is already optimized to resize the number of buckets, in general, when your map ends up in a situation where collisions would be likely. Increase the buckets, decrease the chance of collisions, but then you need more space for the map. It is a tradeoff.Jennee
@Quote I already answered the questions about 'does anything go wrong' etc. I don't have anything to add; I don't see that anybody else has added anything either; and I don't know what else you're looking for.Lisandra
@EJP I have some time to wait and see if anyone else want to add. Specially I found your last answer contradicting with the fact that String hashCode() is prone to collision as stated by ΦXocę 웃 Пepeúpa ツ I would rather wait an read a little more on this and see if anyone else can help. Anyway, Thanks for your long and time taking response :)Quote
@Quote What contradiction? He said exactly what I said. Any hashCode() method is 'prone to collision'. It's in the nature of the beast, and I said so over and over again. I also did not say that String.hashCode() will never produce a collision.Lisandra
S
5

Actually I think the hash collision is Normal. Let talk about a case to think. We have 1000000 big numbers(the set S of x), say x is in 2^64. And now we want to do a map for this number set. lets map this number set S to [0,1000000] .

But how? use hash!!

Define a hash function f(x) = x mod 1000000. And now the x in S will be converted into [0,1000000), OK, But you will find that many numbers in S will convert into one number. for example. the number k * 1000000 + y will all be located in y which because (k * 1000000 + y ) % x = y. So this is a hash collision.

And how to deal with collision? In this case we talked above, it is very difficult to delimiter the collision because the math computing has some posibillity. We can find a more complex, more good hash function, but can not definitely say we eliminate the collision. We should do our effort to find a more good hash function to decrease the hash collision. Because the hash collision increase the time cost we use hash to find something.

Simplely there are two ways to deal with hash collision. the linked list is a more direct way, for example: if two numbers above get same value after the hash_function, we create a linkedlist from this value bucket, and all the same value is put the value's linkedlist. And another way is that just find a new position for the later number. for example, if number 1000005 has took the position in 5 and when 2000005 get value 5, it can not be located at position 5, it then go ahead and find a empty position to took.

For the last question : Does Java generate or at least try to generate unique hashCode per class during object initiation?

the hashcode of Object is typically implemented by converting the internal address of the object into an integer. So you can think different objects has different hashcode, if you use the Object's hashcode().

Sayres answered 21/8, 2017 at 11:43 Comment(0)
B
1

What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?

  • a hash collision is exactly that, a collision of that field hashcode on objects...

What exactly causes Hash Collision - the bad definition of custom class' hashCode() method, OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone, OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?

  • no, collision may happen because they are ruled by math probability and in such cases the birthday paradox is the best way to explain that.

Does anything go wrong or unexpected when Hash Collision happens? I mean is there any reason why we should avoid Hash Collision?

  • no, String class in java is very well developed class, and you dont need to search too much to find a collision (check the hascode of this Strings "Aa" and "BB" -> both have a collision to 2112)

to summarize: hashcode collision is harmless is you know what is that for and why is not the same as an id used to prove equality

Blandishments answered 21/8, 2017 at 11:22 Comment(2)
sorry for my poor Java knowledge on this part, but I really didn't understand this: String class in java is very well developed class, and you dont need to search too much to find a collision (check the hascode of this Strings "Aa" and "BB" -> both have a collision to 2112) If you could please shed some light?Quote
I mean that collisions may happen even on very well developed classes... the String class is an example... 2 instances of it produce hash collision and that doesnt mean that the class is broken or not well written...Josefinejoseito
S
1

What exactly is Hash Collision - is it a feature, or common phenomenon which is mistakenly done but good to avoid?

Neither... both... it is a common phenomenon, but not mistakenly done, that is good to avoid.

What exactly causes Hash Collision - the bad definition of custom class' hashCode() method, OR to leave the equals() method un-overridden while imperfectly overriding the hashCode() method alone, OR is it not up to the developers and many popular java libraries also has classes which can cause Hash Collision?

by poorly designing your hashCode() method, you can produce too many collisions, leaving you equals method un-overridden should not directly affect the number of collisions, many popular java libraries have classes that can cause collisions (nearly all classes actually).

Does anything go wrong or unexpected when Hash Collision happens? I mean is there any reason why we should avoid Hash Collision?

There is degradation in performance, that is a reason to avoid them, but the program should continue to work.

Does Java generate or at least try to generate unique hashCode per class during object initiation? If no, is it right to rely on Java alone to ensure that my program would not run into Hash Collision for JRE classes? If not right, then how to avoid hash collision for hashmaps with final classes like String as key?

Java doesn't try to generate a unique hash code during object initialization, but it has a default implementation of hashCode() and equals(). The default implementation works to know whether two object references point to the same instance or not, and doesn't rely on the content (field values) of the objects. Therefore, the String class has its own implementation.

Stillwell answered 21/8, 2017 at 11:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.