Performance of Apache Drill

Asked 22/8, 2015 at 6:44 Answered 7/10, 2016 at 9:44

Solved hadoop hive impala apache-drill apache-tez

Are there any performance benchmark(genuine ones) that compare Stinger vs Impala vs Drill? Also, which is preferred - my use case will be mainly towards ad-hoc interactive queries on top of Hive. Thanks.

Metopic answered 22/8, 2015 at 6:44 Comment(0)

There are some performance numbers on the site http://allegro.tech/fast-data-hackathon.html.

In general, we see Drill and Impala are comparable in performance for the interactive queries with the differentiation of Drill being its ability to query without metadata definitions and its ease of use working with JSON data.

Note that these tests are on much older versions on Drill such as 0.8/0.9 (also not configured appropriately for data locality). Now Drill is 1.1 with a lot of improvements on SQL (window functions etc) and performance.

Pauper answered 26/8, 2015 at 18:16 Comment(2)

Thanks for your reply, what are your views on Stinger.next? How does it compare against Drill? Any benchmarks to determine which is faster? – Metopic 27/8, 2015 at 3:4

Also, can Drill perform when dealing with datasets of TBs? I read that Impala and Presto are not suitable for complicated queries on huge datasets. – Metopic 27/8, 2015 at 3:18

You cannot do benchmark like this, it's no sense and you should never trust a such benchmark.

Everything will depend on your own data, you have JSON files ? prefer Drill. You want to query more than 1TB, prefer Hive and so on.

Also, you may consider file format, JSON, Kudu, Parquet or ORC.

Then come the optimization, Hive+Tez seems better for parrarel queries but very slow for single query. Whereas Impala is the opposite (MapReduce versus MassiveParrarelProcessing).

Also, you want to consider the hardware ressource, disk SSD or not etc..

I recommend, start with Apache Drill + JSON file, then try Apache Drill with Parquet or ORC.

If you want help, describe exactly what you have (data + hardware) and what you want.

Sedulous answered 7/10, 2016 at 9:44 Comment(4)

Hi Thomas, I am trying to run large drill queries on a single node with 512 GB RAM and 48 CPUs. The query takes too long to run for around 30 GB data. It's taking more than 1 hour to finish aggregating all records. Do you have any tuning parameters which i need to check for this? – Mayers 16/1, 2017 at 9:59

1 node ? You must understand whats is Drill, like PrestoDB, Impala ... it's a MPP massively parallel processing engine, so, it's better to have several nodes ^^ – Sedulous 16/1, 2017 at 10:38

Since we have 48 CPUs can we parallelize between these? – Mayers 16/1, 2017 at 11:51

I guess what he could have said is that the point of drill is to distribute the work among many small cheap workers to process huge amounts of data. If all your data fits in memory you might be better off using something else, there are some great in memory databases. – Usanis 7/10, 2018 at 15:16

Recommended topics

Hot tags