Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

S

2

43

I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS.
My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with either one of those? Not only concerning performance, but also with respect of stability?

Sensibility answered 25/6, 2013 at 6:18 Comment(0)

M

60

Comparison between Hive and Impala or Spark or Drill sometimes sounds inappropriate to me. The goals behind developing Hive and these tools were different. Hive was never developed for real-time, in memory processing and is based on MapReduce. It was built for offline batch processing kinda stuff. Best suited when you need long running jobs performing data heavy operations like joins on very huge datasets.

On the other hand these tools were developed keeping the real-timeness in mind. Go for them when you need to query not very huge data, that can be fit into the memory, real-time. I'm not saying you can't run queries on your BigData using these tools, but you would be pushing the limits if you are running real-time queries on PBs of data, IMHO.

Quite often you would have seen(or read) that a particular company has several PBs of data and they are successfully catering real-time needs of their customers. But actually these companies are not querying their entire data most of the time. So, the important thing is proper planning, when to use what. I hope you get the point i'm trying to make.

Coming back to your actual question, in my view it is hard to provide a reasonable comparison at this time since most of these projects are far from completed. They are not production ready yet, unless you are willing to do some(or maybe a lot) of work on your own. And, for each of these projects there are certain goals which are very specific to that particular project.

For example, Impala was developed to take advantage of existing Hive infrastructure so that you don't have to start from scratch. It uses the same metadata which Hive uses. It's goal was to run real-time queries on top of your existing Hadoop warehouse. Whereas Drill was developed to be a not only Hadoop project. And to provide us a distributed query capabilities across multiple big data platforms including MongoDB, Cassandra, Riak and Splunk. Shark is compatible with Apache Hive, which means that you can query it using the same HiveQL statements as you would through Hive. The difference is that Shark can return results up to 30 times faster than the same queries run on Hive.

Impala is doing good at present and some folks have been using it, but i'm not that confident about rest of the 2. All these tools are good but a fair comparison can be made only after you try these on your data and for your processing needs. But as per my experience Impala would be the best bet at this moment. I am not saying other tools are not good, but they are not yet mature enough. But if you wish to use it with your already running Hadoop cluster(Apache's hadoop for ex) you might have to do some additional work as Impala is used almost by everybody as a CDH feature.

Note : All these things as based on solely my experience. If you find something wrong or inappropriate please do let me know. Comments and suggestions are welcome. And I hope this answers some of your queries.

Maximilianus answered 25/6, 2013 at 18:7 Comment(6)

Thx for the comprehensive answer. It seems to confirm the results of my research in most points. Right now I am POCing some of my use cases in Spark to get some hands-on experience. To me it looks way better documented than Impala (all the academic papers about it are available) and the API is clean and concise. But we will see.. Also I compared Hive to the real-time frameworks, because they tend to compare themselves to it instead to each other. Probably to show off the nice performance gains.. – Sensibility 26/6, 2013 at 8:8

Oh, absolutely..You got the point :)..Good luck with your POC. – Maximilianus 26/6, 2013 at 12:7

One thing to keep in mind - Impala has a major limitation: your intermediate query must fit in memory. So if your group by query exceeds 30GB (your machine ram for example), before applying the HAVING clause which effectively trims it to 1MB of data, the query will fail. This is not the case in other MPP engines like Apache Drill. – Rep 6/5, 2014 at 19:39

"your existing Hadoop warehouse" - If you want to query a MongoDB, you can a SerDer to do so using External Table right, on Hive? So Apache Drill doesn't have any advantage over Impala on this pluggable format aspect. – Rep 9/5, 2014 at 6:57

I don't think "they are not yet mature enough" is a useful thing to say. Could you point out some verifiable facts instead? I'm not even sure what is implied. Too many bugs? Incompatibilities? Small community? I only use Spark from the list, but wouldn't say I experienced either of these. – Mandalay 10/6, 2014 at 13:38

Spark SQL, Drill, and the others now have supported releases and lots of interesting use cases, so I agree that it is time to explain appropriate use cases for each. – Undeniable 23/5, 2015 at 5:14

B

2

Here is an answer of "How does Impala compare to Shark?" from Reynold Xin, the leader of the Shark development effort at UC Berkeley AMPLab.

Bertrand answered 31/10, 2013 at 9:11 Comment(0)

Recommended topics

Hot tags