I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident
and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident
and bt.cate
being null or not, to later make a union all
with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined