Presto vs Impala: architecture, performance, functionality
Asked Answered
L

4

5

Could you highligh major differences between the two in architecture & functionality in 2019? And how that differences affect performance?

For some reason this excellent question was tagged as opinion-based.

Extra-question: why Amazon decide to go with Presto as engine for Athena? Is it anyway better than Impala?

UPD

f PrestoDB and Impala are same why they so differ in hardware requirements? Presto asks 16 GB+ of RAM while Impala asks for 128 GB+ of RAM.

Lathe answered 10/12, 2019 at 21:38 Comment(2)
That 128 is not for heap... If you read further down in the Impala docs, it says only 8 for heapNewfangled
Probably for the same reason that it recommends nodes with 12 or more disks. The Impala requirements appear to be a peak performance recommendation where the Presto is more like a minimum acceptable level. I've played around with Presto and had it working on nodes with 2 GB of ram. I wouldn't recommend it and even with 8 GB nodes I ran out of memory when doing aggregate queries (count, avg, etc...) on large data sets. Impala may just be MUCH more conservative on its minimum recommendations. Also, Presto has support for querying S3 files directly which may be one reason they chose it for AWS.Lebbie
G
11

While the technical architecture, performance and functionality could be a very detailed subject, some of the key highlights I can think of ( based on the journey of both these engines in last so many years ) :

  1. Presto was always tested at the scale ( PB scale ) of Facebook, Netflix, Airbnb, Pinterest and Lyft etc. type of data-driven companies but Impala probably did not have those kinds of massive deployments ( of course they would have had some but those stories are not very well known out in the public ).
  2. Because of the above factor Presto always had a pretty diverse and fast-moving community that helped build this robust engine.
  3. Presto is very close to ANSI SQL compliance which helps with its adoption by traditional Data community.

-Ashish Dubey ( Qubole )

Gutsy answered 11/12, 2019 at 0:48 Comment(1)
I would add that Impala supports more than just Hive-like connectionsNewfangled
R
3

I only came across this recently but want to clarify a misconception.

The Apache Impala minimum memory requirements are not a hard minimum - all functionality works fine with 4-8GB of memory (I use this every day). I would actually guess that, at least for the last few years, Impala is more tolerant of lower memory levels because it has a much more mature memory management and spill-to-disk implementation.

The 128GB recommendation is based on our experience with what you would want for a heavily used production cluster with a demanding workload - one of the worst mistakes people make when planning a deployment is trying to squeeze the memory requirements. It may be a little conservative but we really don't want to recommend something that would be under-resourced and lead to a bad experience.

As far as what the architectural differences are - the Impala dev team at Cloudera has been focused on building a product that works for our 1000s of customers, rather than building software to use by ourselves. What I've learned is that it's actually harder to build things that scale to 1000s of customers than it is to build things that scale to 1000s of nodes in specific deployments.

That means that every feature has to be built robustly and generally enough to handle being put through the paces by all of our customers - if there are any issues, it always comes back to us. We like to say that our customers are going to "use it in anger" - i.e. they are going to push everything to the limit.

We also have a heavy focus on security features that are critical to enterprise customers - authentication, column-level authorization, auditing, etc.

I don't want to get too much into benchmark debates, but I'll say that using the MPP architecture and technologies like LLVM has always given Impala a performance edge and I think we stack up well in any apples-to-apples comparison, particularly on concurrent workloads. I do hear about migrations from Presto-based-technologies to Impala leading to dramatic performance improvements with some frequency.

One disadvantage Impala has had in benchmarks is that we focused more on CPU efficiency and horizontal scaling than vertical scaling (i.e. using all of the CPUs on a node for a single query). That was the right call for many production workloads but is a disadvantage in some benchmarks. We've been addressing that over the last 8-9 months and we're also about to release some multithreading improvements that lead to 2-4x speedups on query latency on standard benchmarks in the upcoming Impala 4.0.

Reveal answered 15/8, 2020 at 19:54 Comment(0)
H
3

Most of the answers here smell of marketing, particularly for Presto. Having used both at large scale in production, I can comfortably say the following:

  • For OLAP, Presto is a dog. Sorry, this is just a fact. It is not a data warehouse made for high performance OLAP queries. It is a query engine made for data fusion. It should be used for that. "Similar architectures" means less than nothing, the specifics of the implementation matter the most and any DB engineer will tell you the same. Frankly, the use of Java should tell you everything you need to know.
  • Lots of crap gets written and released from large companies. Real engineers stop Staning and start benchmarking at some point. When you see a product like Presto that's super light on benchmarks and comparisons, you should be very suspicious.
Hatch answered 22/3, 2022 at 17:21 Comment(0)
T
1

Presto and Impala are very similar technologies with quite similar architecture. And if you go with the benchmarks available over internet then you may get all the possibilities dependent on the writer.

Now, it comes down to the most number of communities backing some technology and Presto is having some edge over there. e.g. Teradata, Qubole, Starbust, AWS Athena etc.

Just to highlight : Presto is very diverse with respect to solving different use cases - Supporting sources like Hive, S3/Blob/gs, many RDBMSs, NoSQL DBs etc, Single query fetching data from multiple sources, Simple architecture with less tuning required etc.

Thievery answered 12/12, 2019 at 16:8 Comment(2)
if Presto and Impala are very similar technologies, than why do their minimal RAM requirements differs almost 10 times? Pls take a look at UPD section of my questionLathe
@Lathe Both the technologies are memory intensive and there is not hard and fast rule to define 128 GB RAM for Impala because it totally depends on the size of the data and kind of queries. One point to note - Impala has been supporting spill-to-disk option from long time (so lower memory would also work but performance) and Presto recently started on that feature which may take some time to mature.Thievery

© 2022 - 2024 — McMap. All rights reserved.