AWS Athena - duplicate columns due to partitionning
Asked Answered
R

3

12

We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly. The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.

What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:

HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns

Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ? That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..

Or any other suggestions ?

Resplendent answered 10/12, 2019 at 13:47 Comment(2)
How much data in terms of size and S3 objects do you have? Would a simple renaming of foo to bar in S3 prefix be a viable solution?Tranquil
Quite a long time to reprocess, yeah we thought about renaming the s3 partition path but we don't know if the hive metastore behind glue will be in kind of inconsistent state or notResplendent
S
16

Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.

Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.

The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.

Sweptwing answered 3/4, 2020 at 16:18 Comment(0)
A
3

You can fix this by updating the schema of the glue table and rename the duplicate column:

  1. Open the AWS Glue console.
  2. Choose the table name from the list, and then choose Edit schema.
  3. Choose the column name foo (not the partitioned column foo), enter a new name, and then choose Save.

Reference: Resolve HIVE_INVALID_METADATA error

Apostasy answered 13/4, 2022 at 19:22 Comment(0)
D
1

I was facing the same issue with the AWS glue crawler . I solved this issue by instead directly creating glue tables in the glue catalog using AWS CLI command - aws glue create-table . With this command you can define your own schema and get rid of the duplicate column .Also creating table this way will refresh it everytime the datastore gets updated with new data.

More reference - https://docs.aws.amazon.com/cli/latest/reference/glue/create-table.html

Dulcle answered 10/5, 2023 at 14:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.