Concatenate s3 files when using AWS Firehose
Asked Answered
H

4

15

I have an AWS Kinesis Firehose stream putting data in s3 with the following config:

S3 buffer size (MB)*       2
S3 buffer interval (sec)*  60

Everything works fine. The only problem is that Firehose creates one s3 file for every chunk of data. (In my case, one file every minute, as in the screenshot). Over time, this is a lot of files: 1440 files per day, 525k files per year.

enter image description here

This is hard to manage (for example if I want to copy the bucket to another one I would need to copy every single file one by one and this would take time).

Two questions:

  • Is there a way to tell Kinesis to group/concatenate old files together. (Eg, files older than 24 hours are grouped into chunks one one day).
  • How is COPY redshift performance affected when COPYing from a plethora of s3 files versus just a few ? I haven't measured this precisely, but in my experience performance with a lot of small files is quite worse. From what I can recall, when using big files, a COPY of about 2M rows is about ~1minute. 2M rows with lots of small files (~11k files), it takes up to 30minutes.

My two main concerns are:

  • Better redshift COPY performance (from s3)
  • Easier overall s3 file management (backup, manipulation of any kind)
Hinz answered 28/4, 2016 at 17:9 Comment(0)
A
8

The easiest fix for you is going to be to increase the firehose buffer size and time limit - you can go up to 15 minutes which will cut your 1440 files per day down to 96 files a day (unless you hit the file size limit of course).

Beyond that, there is nothing in Kinesis that will concat the files for you, but you could setup an S3 lifecycle event that fires each time a new Kinesis file is created and add some of your own code to (maybe running on EC2 or go serverless with Lambda) and do the concatenation yourself.

Can't comment on the redshift loading performance, but I suspect it's not a huge deal, if it was - or will become one, I suspect AWS will do something about the performance since this is the usage pattern they setup.

Adversaria answered 28/4, 2016 at 20:48 Comment(0)
A
2

Kinesis Firehose is designed to allow near real time processing of events. It is optimized for such use cases, and therefore you have such setting as smaller and more frequent files. This way you will get the data faster for queries in Redshift, or more frequent invocations of Lambda functions on the smaller files.

It is very common for customers of the service to also prepare the data for longer historical queries. Even if it is possible to run these long term queries on Redshift, it might make sense to use EMR for these queries. You can then keep your Redshift cluster tuned for the more popular recent events (for example, a "hot" cluster for 3 months on SSD, and "cold" cluster for 1 year on HDD).

It make sense that you will take the smaller (uncompressed?) files in the Firehose output S3 bucket, and transfer them to a more EMR (Hadoop/Spark/Presto) optimized format. You can use services such as S3DistCp, or a similar function that will take the smaller file, concatenate them and transform their format to a Parquet format.

Regarding the optimization for the Redshift COPY, there is a balance between the time that you aggregate the events and the time that it takes to COPY them. It is true that it is better to have larger files when you copy to Redshift, as there is a small overhead for each file. But on the other hand, if you are COPYing the data only every 15 minutes, you might have "quiet" times that you are not utilizing the network or the ability of the clusters to ingest events between these COPY commands. You should find the balance that is good for the business (how fresh do you need your events to be) and the technical aspects (how many events can you ingest in an hour/day to your Redshift).

Autocratic answered 1/5, 2016 at 2:59 Comment(0)
R
2

I faced a similar problem where the number of files were too many to handle. Here's a solution that can be useful :

i) Increase the size buffer to max. (128 MB)

ii) Increase the time buffer to max. (900 secs)

iii) Instead of publishing one record at a time, club multiple records in one (by separating them by a line), to make one kinesis firehose record (max size of a KF record is : 1000 KB).

iv) Also, club multiple kinesis firehose records to form a batch and then do a batch put. (http://docs.aws.amazon.com/firehose/latest/APIReference/API_PutRecordBatch.html)

This will make one s3 object published to be : number of batches that the kinesis firehose stream can hold.

Hope this helps.

Raw answered 19/5, 2016 at 8:35 Comment(2)
Yes. Unfortunately in my case I want the records to be delivered quickly. I cannot afford to wait 900 secs because I want fresh data in semi-realtime. So I'm considering a solution where I load all data up into redshift, then unload everything at once in a single (or just a few) s3 files.Hinz
One more idea suiting the use case : i) Setup AWS Lambda on your S3 bucket. ii) Keep the AWS kinesis firehose stream settings as per your want. iii) Consequently, there will be too many files, as stated in the question. iv) Now, whenever there'll be a publish to the bucket the Lambda function should trigger that would club multiple files into one and put it into a different bucket. If you don't want to put it into a different bucket, you can place it into the same one with a different prefix so that it doesn't trigger the lambda function again. This would be more simple.Raw
A
0

I really like this solution by @psychorama. Infact I could do the same with my project where I was about to give up on the firehose solution. Since I am reading data from dynamodb and putting them into kinesis firehose, I can actually club the whole batch of data from dynamodb into one record with the limit, and then send it to firehose. But not sure if this solution would be easy to implement. Maybe in the 2nd version

Abattoir answered 5/1, 2023 at 14:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.