Hive query stuck at 99%
Asked Answered
L

6

9

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.

Below query works

   Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1; 

But this does not

    Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;

I have increased number of reducers but still it doesn't work.

Lunchroom answered 21/7, 2015 at 2:28 Comment(2)
go to the hadoop jobtracker web-dashboard and find the mapreduce jobs that are failed. That may give you more details..Ingres
What is the size of the data.? Check the rsourcemanager/jobtracker UI and check the job details. See how many tasks got spawned for this job, how many got succeeded, how many failed, reason for failure etc.Bungalow
A
3

If your query is getting stuck at 99% check out following options -

  • Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
  • Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
  • One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
  • Go to resource manager log and see amount of data job is accessing and writing.
Amphibole answered 5/3, 2018 at 5:46 Comment(1)
One of your table is small: how do you describe small. 1:100 or 1:1000 or 1: 10000?Crew
W
5

Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.

set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization

You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)

Webbed answered 28/2, 2016 at 5:15 Comment(0)
L
3

Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.

I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.

  1. hive.auto.convert.join=false
  2. mapred.compress.map.output=true
  3. hive.exec.parallel=true
Lemire answered 21/7, 2015 at 13:42 Comment(3)
Amar I tried all setting.But still query is getting stuck at 99%.Lunchroom
can you share around how much data you have in both the tables ?Lemire
table1 has 36.4 MB and table2 has 204 MB.I tried query with limit 1000 and it worked.But without limit its just get stuck.Lunchroom
A
3

If your query is getting stuck at 99% check out following options -

  • Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
  • Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
  • One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
  • Go to resource manager log and see amount of data job is accessing and writing.
Amphibole answered 5/3, 2018 at 5:46 Comment(1)
One of your table is small: how do you describe small. 1:100 or 1:1000 or 1: 10000?Crew
A
2

Make sure you don't have rows with duplicate id values in one of your data tables!

I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.

After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.

Applicant answered 4/3, 2021 at 18:21 Comment(0)
O
0

use these configuration and try hive> set mapreduce.map.memory.mb=9000; hive> set mapreduce.map.java.opts=-Xmx7200m; hive> set mapreduce.reduce.memory.mb=9000; hive> set mapreduce.reduce.java.opts=-Xmx7200m

Obsolete answered 4/3, 2018 at 15:35 Comment(0)
H
0

I faced the same problem with a left outer join similar to:

select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident 
and bt.cate - sm.cate

I made an analysis based on the already given answers and I saw two of the given problems:

Left table was more than 100x bigger than the right table

select count(*) from big_table   -- returned 130M
select count(*) from small_table -- returned 1.3M

I also detected that one of the join variable was rather skewed in the right table:

select count(*), cate 
from small_table 
group by cate 

-- returned
-- A    70K
-- B   1.1M
-- C   120K

I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:

set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;

Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate

So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:

select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident 
and bt.cate - sm.cate
    union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined
Helmand answered 11/4, 2019 at 10:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.