ETL pipeline in AWS with s3 as datalake how to handle incremental updates
Asked Answered
R

5

5

I have setup ETL pipeline in AWS as follows

input_rawdata -> s3 -> lambda -> trigger spark etl script (via aws glue )-> output(s3,parquet files )

My question is lets assume the above is initial load of the data ,how do I setup to run incremental batches that come every day(or every hour) which add new rows or update existing records

a.) how do I keep appending to the same s3 parquet files. so that subsequent presto db queries result in latest data.

b.) how to handle duplicate records get the latest timestamp for the queries.

In the spark script , do I need to create Hive external table with source as s3 and use in presto db ?

thanks for any inputs.

Reubenreuchlin answered 6/9, 2017 at 4:23 Comment(0)
R
3

Apache Hudi would be a great tool for that: https://hudi.incubator.apache.org/ You can do upserts to tables with data stored in S3 in Parquet format and Presto is compatible with it. For example with EMR 5.28 Hudi is installed already and you can query Hudi datasets with Hive, Spark and Presto.

Rexanna answered 21/12, 2019 at 21:25 Comment(0)
D
2

you can define job bookmark at ETL job.

bookmark keeps the track of s3 files processing so once your historical load is processed and if you dump the new files on s3 then only new files are going to processed by etl job and marked those file as processed internally.

you can handle this way incremental data.

Dannie answered 21/11, 2017 at 14:34 Comment(0)
S
1

First of all don't try to append to the files exist in s3, rather create files with multiple records.

To query s3 for analytics you can firat describe the data using AWS Athena where its data catalog is Hive metastore compatible.

To remove duplicates you can write SQL like query through Athena to query unique record sets.

Schutz answered 6/9, 2017 at 9:32 Comment(0)
P
1

Since you are already using Lambda and Glue you could use Kinesis and KCL to capture your data as a stream, or use Spark Streaming, as you have a spark script. Either of these options would give you the incremental input you are looking for. There is less risk of data corruption when using real-time streaming into an existing database.

Then you can pass the data through your ETL process with Glue. You can schedule or chain your ETL jobs in Glue and it can load your transformed data into an AWS bucket. Glue is batch oriented, but the minimal interval is 5 minutes, and with Kinesis performing the first steps and then passing off the completed data to Glue you can still have incremental updates. You can review this to get additional resources and ideas on ETL architecture.

For any duplicate data you can run a SQL-ish query on your completed datasets, if necessary.

Penley answered 27/11, 2017 at 9:7 Comment(0)
L
0

You can now use Delta to make upserts, appends and deltas on your data using spark see this. This tools let's you write data in "delta" format (Spark + metadata files). You can even recover or query data to a point in time. Be careful it recently wasn't fully working with Athena/Presto (in open-source) since you needed to create a manifest (but a fix was on it's way).

Linsk answered 30/12, 2019 at 16:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.