why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location
Asked Answered
D

2

5

why does AWS Athena needs 'spill-bucket' when it dumps results in target S3 location

WITH
( format = 'Parquet', 
parquet_compression = 'SNAPPY', 
external_location = '**s3://target_bucket_name/my_data**' 
) 
AS
WITH my_data_2 
AS 
    (SELECT * FROM existing_tablegenerated_data" limit 10)
SELECT *
FROM my_data_2;

Since it already has the bucket to store the data , why does Athena need the spill-bucket and what does it store there ?

Dikmen answered 24/2, 2021 at 8:54 Comment(0)
E
8

Trino/Presto developer here who was directly involved in Spill development.

In Trino (formerly known as Presto SQL) the term "spill" refers to dumping on disk data that does not fit into memory. It is an opt-in feature allowing you to process larger queries. Of course, if all your queries require spilling, it's more efficient to simply provision a bigger cluster with more memory, but the functionality is useful when larger queries are rare.

Spilling involves saving temporary data, not the final query results. The spilled data is re-read back and deleted before the query completes execution.

Emelda answered 24/2, 2021 at 13:33 Comment(0)
P
2

Athena uses Lambda functions to connect to External Hive data stores Because of the limit on Lambda function response sizes, responses larger than the threshold spill into an Amazon S3 location that you specify when you create your Lambda function. Athena reads these responses from Amazon S3 directly.

https://docs.aws.amazon.com/athena/latest/ug/connect-to-data-source-hive.html

Patron answered 2/6, 2021 at 7:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.