Athena with partition projection returns no results
Asked Answered
C

1

6

While doing a proof of concept for our new ETL pipeline, I figured out some problems using partition projection in AWS Athena. Created the following table in glue:

CREATE EXTERNAL TABLE `test_interactions`(
  `id` string, 
  `created_at` timestamp, 
  `created_by` string, 
  `type` string, 
  `entity` string)
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'projection.dt.format'='yyyy-MM-dd-HH', 
  'projection.dt.interval'='1', 
  'projection.dt.interval.unit'='HOURS', 
  'projection.dt.range'='2020-12-01-00,NOW', 
  'projection.dt.type'='date', 
  'projection.enabled'='true', 
  'storage.location.template'='s3://test-aggs/test-interactions/dt=${dt}') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://test-aggs/test-interactions/'
TBLPROPERTIES (
  'classification'='parquet')

On S3, there are the matching .parquet files coming from Kinesis Data Firehose:

test-aggs/test-interactions/dt=2020-12-03-22/file1.parquet
test-aggs/test-interactions/dt=2020-12-03-22/file2.parquet

Trying to query the data by:

SELECT * FROM "test_aggs"."test_interactions"
WHERE dt >= '2020-12-02-00' 
AND dt < '2020-12-04-01'

or by

SELECT * FROM "test_aggs"."test_interactions"
WHERE dt = '2020-12-03-22' 

returns zero results.

Running

MSCK REPAIR TABLE;

makes the data queryable but for using this slow command, I don't have to enable partition projection.

Any ideas why this is not working?

Cheers!

Calpac answered 4/12, 2020 at 0:12 Comment(4)
Athena cannot detect new partitions by itself. You may use scheduled Glue Crawlers to add new partitions. – Tonguelashing
@Tonguelashing it can. Check out partition projection at the AWS docs πŸ™Œ But you still need the matching hive metadata for querying the data outside of athena. – Calpac
Did you manage to solve the problem @maxgr? – Tonguelashing
@Tonguelashing Not now. Contacted the AWS business support with this issue, will give an update as soon as I've solved this ✌️ – Calpac
C
3

Solved the issue. The problem was that i've added the projection configuration under SERDE PROPERTIES and not under TBLPROPERTIES.

CREATE EXTERNAL TABLE `test_interactions`(
  `id` string, 
  `created_at` timestamp, 
  `created_by` string, 
  `type` string, 
  `entity` string)
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://test-aggs/test-interactions/'
TBLPROPERTIES (
  'classification'='parquet', 
  'projection.dt.format'='yyyy-MM-dd-HH', 
  'projection.dt.interval'='1', 
  'projection.dt.interval.unit'='HOURS', 
  'projection.dt.range'='2020-12-01-00,NOW', 
  'projection.dt.type'='date', 
  'projection.enabled'='true', 
  'storage.location.template'='s3://test-aggs/test-interactions/dt=${dt}')
Calpac answered 5/12, 2020 at 12:11 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.