Is Hive faster than Spark?
Asked Answered
S

3

6

After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question.

Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance.

He uses the latest Hive, which from seems to be using Tez.

Subscribe answered 9/9, 2016 at 16:30 Comment(7)
Put them on equivalent hardware and run comparable workloads. You'll know the answer. :)Angi
Correct @SergioTulentsev, but wouldn't that might be data-specific? I mean what I am trying to ask here, is something like is Spark faster than Hadoop?..Because let's say I did the experiment, I would still wouldn't know why. I am trying to understand theoretically what would happen.. :)Subscribe
Facebook has successfully ported a massive batch job from Hive to Spark. It took them several months of debugging (and 13 Spark JIRAs) and tuning. But now their job runs much faster. Are you up to the challenge?? code.facebook.com/posts/1671373793181703/…Grating
IBM tried to run a TPC-DS benchmark with Spark 2.0 at scale. But in the end they had to tweak a lot of configuration properties, both documented and undocumented, to make it through. Are you up to the challenge?? slideshare.net/jcmia1/apache-spark-20-tuning-guide/2Grating
@SamsonScharfrichter there are some really cool links, thank you! I feel what the first says, when I tried to scale a pipeline we had to 15T. Thank you!Subscribe
Sorry to add to your confusion, but you can run Hive on top of Spark as well (aka, use Spark as data processing engine for your queries). That approach will yield query latency in the same ballpark as that of Hive-on-Tez (while offering the opportunity to consolidate all your data processing onto the Spark API). Generally speaking, Hive and Spark SQL are intended for two different things and IMO they shouldn't be compared on a "performance" bases.Goldstein
@JustinKestelyn you did the right thing to comment, thank you, I see your point, makes sense! :)Subscribe
C
4

Hive is just a framework that gives sql functionality to MapReduce type workloads.

These workloads can run on mapreduce or yarn.

So comparing Hive on tez vs Hive on spark. Nice article below discussing this When to go with ETL on Hive using Tez VS When to go with Spark ETL? (Gist use Hive on spark if not sure).

Benchmark information

Lower the better

Counterblow answered 9/9, 2016 at 16:50 Comment(6)
Krishna thank you very much. Stackoverflow appreciates links, but sometimes these links die and the future users can't be helped. Would you be so kind as to update your answer with the gist/intuition/basic idea of the article? :)Subscribe
@Subscribe thanks for the feedback. I will edit this answer.Counterblow
Chart needs to be updated, as we now have Spark 2.0 with a lot of optimization - some queries runs about 100x faster, most queries about 10x faster than in Spark 1.x :)Fairy
@T.Gawęda good point! Shall you find something better, please post an answer! :Subscribe
@Subscribe Yes I will write longer answer with focus on how Spark supports Hive, but tomorrow - in Poland there is a night now ;)Fairy
Can you change the line from can run in mapreduce or yarn to can run on mapreduce or tezUnconcern
G
4

Spark is convenient but does not handle scale all that well as regards SQL performance.

Hive has amazing support for co-partitioned joins. When the tables you were joining have hundreds of millions to billions of rows you will really appreciate the fine grained join support via:

  • similar distribute by and sort by (or cluster by)
  • bucketed joins

Hive has extensive support for metadata-only queries: Spark has only had a glimmer of it since 2.1

Spark runs out of steam quickly when the number of partitions exceeds maybe 10K+. Hive does not suffer from this limitation.

Geomancy answered 20/9, 2017 at 5:18 Comment(0)
G
1

Fast forward to 2018, Hive is much faster (and more stable) than SparkSQL, especially in concurrent environments, according to the following article:

https://mr3.postech.ac.kr/blog/2018/10/31/performance-evaluation-0.4/

The article compares several SQL-on-Hadoop systems using the TPC-DS benchmark (1TB, 3TB, 10TB) using three clusters (11 nodes, 21 nodes, 42 nodes):

  • Hive-LLAP included in HDP(Hortonworks Data Platform) 2.6.4
  • Hive-LLAP included in HDP 3.0.1
  • Presto 0.203e (with cost-based optimization enabled)
  • Presto 0.208e (with cost-based optimization enabled)
  • SparkSQL 2.2.0 included in HDP 2.6.4
  • SparkSQL 2.3.1 included in HDP 3.0.1
  • Hive 3.1.0 running on top of Tez
  • Hive on Tez included in HDP 3.0.1
  • Hive 3.1.0 running on top of MR3 0.4
  • Hive 2.3.3 running on top of MR3 0.4

So, in comparison with Hive-based systems and Presto, SparkSQL is very slow and does not scale in concurrent environments. (Note that the experiment uses SparkSQL running on vanilla Spark.)

Garibull answered 2/11, 2018 at 1:51 Comment(1)
I don't have an installation to check that now, so I can't say more, but others might find that useful.Subscribe

© 2022 - 2024 — McMap. All rights reserved.