Concatenate files in S3 using AWS Lambda
Asked Answered
H

2

5

Is there a way to use Lambda for S3 file concatenation?

I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).

I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).

The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.

Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.

Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?

Hypha answered 21/9, 2016 at 8:1 Comment(0)
N
5

I would reconsider whether you actually want to do this:

  • The S3 costs will go up.
  • The pipeline complexity will go up.
  • The latency from Firehose input to Spark input will go up.
  • If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…

Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:

  • You can archive S3 objects almost immediately, lowering costs.
  • Data is available in Spark as soon as possible.
  • If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
  • There's a tiny amount of latency increase from establishing TCP connections and authentication.

I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:

  • A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
  • An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
  • A live Spark/EMR/underlying data service instance ready to receive the data.
  • In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.

Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.


If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.

Nightshade answered 21/9, 2016 at 8:32 Comment(2)
Well my goal of creating larger files was exactly based on the fact that previously we had set up Firehose so it would write a new file every minute (or 1mb), but reading that data into Spark took sooo much time. Using read.json for ~3gb per data took ~15s for 1 file and about 10 minutes for small files (I don't remember the instance types atm). I want to use Spark for 2 things: Statistics and eventually Machine Learning. For statistics, I want to filter and group the data and get different counts. I thought that keeping a Spark EMR Cluster alive 24/7 would produce higher costs than S3 wouldHypha
I think I am still missing something from the bigger picture. As the Lambda solution failed, I thought about creating a scheduled Spark job, which reads in previous day's data from S3 and writes it to another bucket in Parquet format (which is faster than reading plain json). But I still don't know what did you mean by sending the data to Spark immediately. Did you mean Spark Streaming? Would that still mean running an EMR cluster 24/7? I planned to keep all the data in S3 and for every statistics query to start a new cluster, read in the data, run the calculations and write results to S3.Hypha
G
5

You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

Gerhan answered 31/5, 2017 at 12:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.