Left Join Vs Inner Join in Hive -- internals and performance on multiple joins, map joins
Asked Answered
E

0

8

Does anyone know if there is a difference in performance for left join vs inner join in Hive, with Map Join enabled via hive.auto.convert.join=True?

The reason I ask, per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-JoinOptimization

Outer joins offer more challenges. Since a map-join operator can only stream one table, the streamed table needs to be the one from which all of the rows are required. For the left outer join, this is the table on the left side of the join; for the right outer join, the table on the right side, etc. This means that even though an inner join can be converted to a map-join, an outer join cannot be converted. An outer join can only be converted if the table(s) apart from the one that needs to be streamed can be fit in the size configuration.

It seems like this is saying (a) an outer join can't be converted to an inner join at all and (b) it can only be converted if the table that doesn't need to be streamed is the "left join" table(s). Does anyone know which one it is?

Also, is there a difference in performance for INNER JOIN vs LEFT JOIN in general, in Hive, as there is in SQL? Does that difference become more magnified (and or, start to exist in the first place) when several left joins are involved? The reason I ask is I'm considering adding several dummy entries to some left joined lookup tables to convert my joins to inner... intuitively it seems like it might make a difference, performance wise, but I can't find any documentation or discussion either way. Curious if anyone has experience with this.

Etty answered 18/6, 2014 at 21:45 Comment(3)
I am wondering the same question... I have a table that OUGHT to have all of the entries of the other table, but to be safe I've been doing a left join and later filtering out the NULLs. If someone can convince me that an inner join is always faster, I'll switch to using that instead.Maronite
Why don't you just measure? I have never seen performance difference between INNER and LEFT joins. Maybe this difference lies in measurement error from cluster loading and parallel execution.Discernible
@Evan Volgas what's the verdict of your findings ?Crabbe

© 2022 - 2024 — McMap. All rights reserved.