Partition Kinesis firehose S3 records by event time
Asked Answered
S

6

11

Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.

Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?

The event time could be in the partition key or I could use a Lambda function to parse it from the record.

Skep answered 9/2, 2017 at 6:22 Comment(0)
D
9

Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.

The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.

Ducharme answered 13/2, 2017 at 18:33 Comment(0)
D
1

It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.

First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.

And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.

What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.

Disseminate answered 20/12, 2017 at 14:48 Comment(0)
S
1

You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.

We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.

First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.

Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.

Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.

You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.

I actually wrote more about this in a blog post here.

Shaven answered 14/6, 2021 at 18:9 Comment(0)
M
0

For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects

https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html

Matsu answered 9/9, 2020 at 7:2 Comment(1)
This does not answer the question. The question is about "event time", i.e., a time field in the event. Firehose only supports "processing time", see: Kinesis Data Firehose uses the approximate arrival timestamp of the oldest record that's contained in the Amazon S3 object being written.Durban
W
0

AWS started offering "Dynamic Partitioning" in Aug 2021:

Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.

https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html

Watersick answered 10/12, 2021 at 9:46 Comment(0)
C
0

Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.

Caterwaul answered 5/1, 2023 at 14:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.