Athena vs Redshift Spectrum

Asked 9/5, 2018 at 9:38 Answered 6/1, 2021 at 21:17

amazon-web-services amazon-redshift amazon-athena amazon-redshift-spectrum

I am kind of evaluating Athena & Redshift Spectrum. Both serve the same purpose, Spectrum needs a Redshift cluster in place whereas Athena is pure serverless. Athena uses Presto and Spectrum uses its Redshift's engine

Are there any specific disadvantages for Athena or Redshift spectrum? Any limitations on using Athena or Spectrum ?

Promissory answered 9/5, 2018 at 9:38 Comment(1)

Good Read: upsolver.com/blog/…. – Copilot 8/9, 2020 at 15:58

I have used both across a few different use cases and conclude:

Advantages of Redshift Spectrum:

Allows creation of Redshift tables
Able to join Redshift tables with Redshift spectrum tables efficiently

If you do not need those things then you should consider Athena as well

Athena differences from Redshift spectrum:

Billing. This is the major difference and depending on your use case you may find one much cheaper than the other
Performance. I found Athena slightly faster.
SQL syntax and features. Athena is derived from presto and is a bit different to Redshift which has its roots in postgres.
Connectivity. Its easy enough to connect to Athena using API,JDBC or ODBC but many more products offer "standard out of the box" connection to Redshift

Also, for either solution, make sure you use the AWS Glue metadata, rather than Athena as there are fewer limitations.

Limes answered 9/5, 2018 at 9:56 Comment(5)

Thanks a lot Mr. Jon Scott, – Promissory 9/5, 2018 at 10:0

hello, so athena costs less than spectrum ? also does using redshift spectrum increase the load on existing redshift usage – Coriander 2/8, 2020 at 2:44

In the 2 years since this post, there have been some improvements to redshift spectrum query planning however in general the findings still hold true. Redshift spectrum alone does not use much redshift resources, as it is just planning the query and pushing it down to Athena. – Limes 3/8, 2020 at 7:57

@JonScott Please can you expand on your assertion that Redshift spectrum "is just planning the query and pushing it down to Athena"? I don't have 1st hand evidence to the contrary but it seems at least a little surprising – Brewage 3/1, 2021 at 17:7

It is certainly NOT using your redshift cluster in a meaningful way. however you may be correct in that it is somehow separate from athena. The tests I did seems to show close alignment of athena and redshift however I cannot find any documentation off AWS to prove this either way. Since I posted the above answer, athena has more and more suffered from query queueing, I have not seen the same happen in Redshift spectrum. – Limes 5/1, 2021 at 10:9

This question has been up for quite a time, but still, I think I can contribute something to the discussion.

What is Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. (From the Doc)

Pretty straight forward, right?

Then comes the question of what is Redshift Spectrum and why Amazon folks made it when Athena was pretty much a solution for external table queries?

So, AWS folks wanted to create an extension to Redshift (which is pretty popular as a managed columnar datastore at this time) and give it the capability to talk to external tables(typically S3). But they wanted to make life easier for Redshift users, mostly analytics people. Many analytics tools don't support Athena but support Redshift at this time. But creating your Reshift cluster and storing data was a bottleneck. Again Redshift isn't that horizontally scalable and it takes some downtime in case of adding new machines. If you are a Redshift user, making your storage cheaper makes your life so much easier basically.

I suggest you use Redshift spectrum in the following cases:

You are an existing Redshift user and you want to store more data in Redshift.
You want to move colder data to an external table but still, want to join with Redshift tables in some cases.
Spark unloading of your data and if you just want to import data to Pandas or any other tools for analyzing.

And Athena can be useful when:

You are a new user and don't have Redshift cluster. Access to Spectrum requires an active, running Redshift instance. So Redshift Spectrum is not an option without Redshift.
As Spectrum is still a developing tool and they are kind of adding some features like transactions to make it more efficient.
BTW Athena comes with a nice REST API , so go for it you want that.

All to say Redshift + Redshift Spectrum is indeed powerful with lots of promises. But it has still a long way to go to be mature.

Eustis answered 28/1, 2019 at 19:46 Comment(1)

hello, i had a question, which approach is cost effective : using athena to join s3 tables and redshift tables ( using glue crawlers ) OR using redshift spectrum to join redshift tables with s3 external tables? does using redshift spectrum increase the load on existing redshift usage – Coriander 2/8, 2020 at 2:41

If you are using Redshift database then it will be wise to use Spectrum along with redshift to get the required performance.

However, if you are beginning to explore options then we can consider Athena as a tool to go ahead.

Excommunicative answered 15/5, 2018 at 15:45 Comment(3)

The data scan charges for spectrum & athena are same. Even if I use Redshift, why would I use Spectrum and add more burden on the Redshift cluster. If the cluster is already being used for reporting, then it will be an overkill to use Spectrum. Instead, I can leverage Athena and transfer the load to Athena. – Promissory 16/5, 2018 at 9:54

Does Spectrum uses red-shift cluster ? I thought its similar architecture to add more servers virtually to assist the redshift cluster. Only thing we do here is create external tables ? Isn't it ? – Excommunicative 24/5, 2018 at 22:19

@Ramakrishna there are some additional workers between the Redshift cluster and S3, and if you craft queries carefully there can be some push-down-predicate optimization at that level, but then data is read into the Redshift cluster and there is concurrency contention with existing Redshift usage. You can see it in the explain plans. The ideal use case is when you have a lot of data - store it cheaply on s3 - but don't need to scale your compute. With plain old Redshift, storage is tied to compute so both must scale together. – Mages 30/7, 2019 at 15:16

I had learned (from Adrian Cantril's/LA's 2019 SA Pro course) that Redshift Spectrum would use one's own Redshift cluster to provide more consistent performance than is available by leveraging the shared capacity which AWS makes available to Athena queries. I appreciate this information might only be useful for the exam, I didn't find his argument convincing.

I wrote this answer because I wasn't satisfied with the leading answer's treatment of Athena outperforming Redshift Spectrum. The rest of that answer is good and I do not mean to directly copy any of that here (without references it hadn't registered with me when I wrote this).

I (again, based solely on my hands-off research) would choose Spectrum when the majority of my data is in S3, which would typically be for the larger data sets. The recent RA3 instances seem to overlap this niche though. So I say Spectrum is most suited to where we have long term Redshift clusters that, being OLAP nodes, have spare capacity to query S3.

Why would you use your own estate to perform the queries that Athena would do without such an investment from you? Caching, where it fits. And consistent performance, if I am to believe Adrian Cantrill more than Jon Scott. This made me suspect RA3 might be edging Spectrum out; that and the lack of decent literature on Spectrum. Why would Amazon offer a serverless product in Athena that outperforms Redshift Spectrum which is more expensive? This is how they are choosing to deprecate RRS. I can't believe Spectrum is deprecated so must offer this answer to contest this. Just look at https://aws.amazon.com/redshift/whats-new/.

I think the picture below (from https://d1.awsstatic.com/events/Summits/AMER2020/May13SummitOnline/Modernize_your_data_warehouse.pdf) is fairly clear that compute nodes are influential here, and perhaps contrary to @JonScott's valuable insights above.

One final big difference is Athena is limited to IAM for authentication, as depicted in this reinvent 2018 (ANT201-R1) slide:

Brewage answered 6/1, 2021 at 21:17 Comment(0)

One big limitation and differing factor is the ability to use structured data. Athena supports it for both JSON and Parquet file formats while Redshift Spectrum only accepts flat data.

Another is the availability of GIS functions that Athena has and also lambdas, which do come in handy sometimes.

Now if you ran a standalone new Postgres then that does everything and more, but as far as comparison between Redshift (and Spectrum) goes - it's a tool that has stopped evolving.

Crypt answered 9/5, 2018 at 19:30 Comment(4)

Spectrum supported formats: JSON, ION, Parquet, RCFile, SequenceFile,TextFile, ORC, Avro – Socage 9/5, 2018 at 20:27

Yes it supports those formats, but only flat data written in those formats. You can write a struct and array in Parquet and can query it with Athena, but not through Spectrum. – Crypt 11/5, 2018 at 5:2

I think at the time this was posted, LauriK made a fair point on some of the above. As a data engineer, having the data nested in some use cases is highly paramount in order to store the data efficiently. Redshift Spectrum now supports nested data though aws.amazon.com/about-aws/whats-new/2018/08/… – Longevous 14/3, 2019 at 8:12

Spectrum still has a lot of limitations with nested data docs.aws.amazon.com/redshift/latest/dg/… Storing data efficiently is not the ideal use case for nested data. The common case is querying raw json and flattening it out to something useful like partitioned parquet or orc. – Mages 30/7, 2019 at 15:12

Recommended topics

Hot tags