How to change column names of autodetected partitions created by Glue Crawler?
Asked Answered
A

2

8

I have bucket which is used as destination for a Kinesis Firehose stream.

Firehose automatically creates date-based prefixes on that bucket using the yyyy/mm/dd/HH format.

Then I created a crawler that will search for data into this bucket and configured it as follow:

Crawler configuration

After running the crawler, it creates a table with the following schema:

| #   | Column name   | Data type | Key           |
| --- | -----------   | --------- | ------------- |
| 1   | numberissues  | int       |               |
| 2   | group         | string    |               |
| 3   | createdat     | string    |               |
| 4   | companyunitid | string    |               |
| 5   | partition_0   | string    | Partition (0) |
| 6   | partition_1   | string    | Partition (1) |
| 7   | partition_2   | string    | Partition (2) |
| 8   | partition_3   | string    | Partition (3) |

If I rename the partition-* to their right counterparts year, month, day and hour, the table is ready for me to use.

However, if the crawler runs again, the schema revets the column names to the original partition-*.

I know this would work for Hive partition schemas year=2018/month=04..., but I want to know if it's possible to "hint" Glue about the partition field names.

Another alternative would be trying to change the Firehose prefixing, but I couldn't find anything that suggests this is even possible.

Arielle answered 6/4, 2018 at 19:22 Comment(1)
I understand your situation. Unfortunately Glue Follows Hive style of partitioning but kinesis firehose Doesn't write in such format. At this point in time i think we cant do any thing.Lighting
P
15

In this case you can set the "Ignore the change and don't update the data catalog" option.

Then you can rename the columns. This will allow the crawler to detect new partitions on the next run but keep therenamed names.

Puentes answered 10/4, 2018 at 23:51 Comment(1)
It seems so obvious now that this would work that I'm embarassed. Thanks a lot, @RicardoMayerhoferArielle
V
2

It is now possible to specify a custom format for the S3 prefixes written by Firehose. To comply with the Hive style of partitions, you can use this syntax in your prefix:

beginning_of_prefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

Example output:

beginning_of_prefix/year=2021/month=09/day=03/hour=16/

This will enable your Glue crawler to recognize the names of the partitions.

In a bit more details, the !{namespace:value} syntax AWS introduced allows accessing the timestamp that Firehose uses for partitioning and printing it into the prefix. This is done by specifying timestamp as a namespace, and a valid Java DateTimeFormatter string as the value. Note that:

When evaluating timestamps, Kinesis Data Firehose uses the approximate arrival timestamp of the oldest record that's contained in the Amazon S3 object being written.

And that:

If you specify a prefix that doesn't contain a timestamp namespace expression, Kinesis Data Firehose appends the expression !{timestamp:yyyy/MM/dd/HH/}to the value in the Prefix field.

(So if you don't use the timestamp namespace, the old way of partitioning is used)

Other namespaces also enable partitioning by firehose error type for the error output prefix for example.

Source

Docs

Viscountess answered 3/9, 2021 at 15:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.