Will Spark SQL completely replace Apache Impala or Apache Hive? [closed]

Asked 25/10, 2016 at 9:37 Answered 6/3 at 19:52

I need to deploy Big Data Cluster on our servers. But I just know about knowledge of Apache Spark. Now I need to know whether Spark SQL can completely replace Apache Impala or Apache Hive.

I need your help. Thanks.

Kaiulani answered 25/10, 2016 at 9:37 Comment(0)

I would like to explain this with real time scenarios

In real time Production projects:

Hive is used mostly for storing data/tables and running ad-hoc queries if the organisation is increasing their data day by day and they use RDBMS data for querying then they can use HIVE.

Impala is used for Business intelligence projects where the reporting is done through some front end tool like tableau, pentaho etc..

and Spark is mostly used in Analytics purpose where the developers are more inclined towards Statistics as they can also use R launguage with spark, for making their initial data frames.

So answer to your question is "NO" spark will not replace hive or impala. because all three have their own use cases and benefits , also ease of implementation these query engines depends on your hadoop cluster setup.

Here are some links which will help you understand more clearly:

http://db-engines.com/en/system/Hive%3BImpala%3BSpark+SQL

http://www.infoworld.com/article/3131058/analytics/big-data-face-off-spark-vs-impala-vs-hive-vs-presto.html

https://www.dezyre.com/article/impala-vs-hive-difference-between-sql-on-hadoop-components/180

Tease answered 25/10, 2016 at 10:16 Comment(0)

No. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Impala - open source, distributed SQL query engine for Apache Hadoop.

Hive - an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Refer: Differences between Hive and impala

Apache Spark has connectors to various data sources and it does processing over the data. Hive provides a query engine which helps faster querying in Spark when integrated with it.

SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.

Refer: Databricks blog

Jagatai answered 25/10, 2016 at 10:10 Comment(0)

Apache Impala provides a low-latency access to data and is generally used with front-end business intelligence applications.

Apache Hive is more suitable for batch processing where query latency isn’t a concern. e.g. data processing for financial applications based end-of-day attributes (like value of a stock at close of business)

While Apache Spark has varied applications from Streaming to Machine Learning, it is also being used for Batch ETL processing. The enhanced dataset-based Spark SQL API available in Spark 2+ has improved components in the form of Catalyst Query Optimizer and WholeStageCodeGen. I have observed improvements in the order of 50-90% faster execution time for some Hive scripts were translated from HiveQL to Scala on Spark.

A few challenges in moving from HiveQL to dataset-based Spark API:

Lack of a sweet SQL-like syntax present in Hive.
Incomplete integration of the dataset API with Scala language constructs
Lack of compile time error reporting in some dataset operations

Winifield answered 4/3, 2019 at 12:54 Comment(0)

This is good question. I think it will not. Even though Spark is faster than other two, still each of them have their own purposes and way of working. For example, for those who familiar with Query language, Hive and Impala will be eaiser for them to use, and Spark can use Hive metastore for better optimization. So , I think it will not compately replace.

Clemens answered 25/10, 2016 at 10:3 Comment(0)

So things have changed a lot since originally the question was asked.

It seems it is advisable to move to SparkSQL from from Hive for two reasons

Speed of results
Compute efficiency
Total Cost

There are studies with more than 80% speed and 50% cost saving with SparkSQL

While not sure, but Impala isn't much used by now anyways.

Lastly OSS is giving much better support to Spark compared to Hive.

Playbook answered 6/3 at 19:52 Comment(0)

Recommended topics

Hot tags