Aggregate functions in S3 Select
Asked Answered
W

1

5

Amazon's S3 Select allows a user to write SQL against S3 objects, but there's insufficient documentation around what standard SQL functionality is supported.

According to the documentation, Amazon S3 Select supports AVG, COUNT, MAX, MIN, and SUM. But when I run any aggregate query I get an error like

botocore.exceptions.ClientError: An error occurred (UnsupportedSqlOperation) when calling the SelectObjectContent operation: Unsupported SQL operation GROUP BY. Please check the service documentation for supported operations.

Here's some code to reproduce:

import boto3
client = boto3.client('s3')

response = client.select_object_content(Bucket='my-bucket', Key='object.csv', 
    ExpressionType='SQL', Expression="select ID, count(*) from s3object group by s.ID ",
    InputSerialization = {'CSV':{"FileHeaderInfo": "Use"}, 'CompressionType': 'GZIP'},
    OutputSerialization = {'CSV':{}})

event_stream = response['Payload']
with open('output', 'wb') as f:
    for event in event_stream:
        if 'Records' in event:
            data = event['Records']['Payload']
            f.write(data)
Weiler answered 7/5, 2018 at 16:15 Comment(1)
If possibly supports the Aggregate functions without a GROUP BY. This would aggregate all the results into just one line. (I haven't tried it.)Embryologist
W
9

Per John's comment, aggregate functions are only supported without a GROUP BY.

Here are a few caveats and pointers

  • You can obviously filter the set you want to perform your aggregation across, using a WHERE or LIMIT clause.
  • Everything is a string, so any numeric operation requires a CAST, e.g. sum(cast(my_col as float))
  • Mixed results using MAX and MIN on interpreting numerical columns as strings, i.e. without a cast; it initially succeeded but after further processing I couldn't reproduce that success
  • Can't perform aggregate functions in the same query as non-aggregate functions, and the error message is cryptic: A column name or a path provided in the query expression does not exist
Weiler answered 8/5, 2018 at 16:34 Comment(2)
Did you find a way to mimic a group by?Rosaliarosalie
No, I don't think it exists. Stream the data into an aggregation routine, or just use Athena.Weiler

© 2022 - 2024 — McMap. All rights reserved.