I have an AWS Kinesis Firehose stream putting data in s3 with the following config:
S3 buffer size (MB)* 2
S3 buffer interval (sec)* 60
Everything works fine. The only problem is that Firehose creates one s3 file for every chunk of data. (In my case, one file every minute, as in the screenshot). Over time, this is a lot of files: 1440 files per day, 525k files per year.
This is hard to manage (for example if I want to copy the bucket to another one I would need to copy every single file one by one and this would take time).
Two questions:
- Is there a way to tell Kinesis to group/concatenate old files together. (Eg, files older than 24 hours are grouped into chunks one one day).
- How is COPY redshift performance affected when
COPY
ing from a plethora of s3 files versus just a few ? I haven't measured this precisely, but in my experience performance with a lot of small files is quite worse. From what I can recall, when using big files, a COPY of about 2M rows is about ~1minute. 2M rows with lots of small files (~11k files), it takes up to 30minutes.
My two main concerns are:
- Better redshift COPY performance (from s3)
- Easier overall s3 file management (backup, manipulation of any kind)