I have setup ETL pipeline in AWS as follows
input_rawdata -> s3 -> lambda -> trigger spark etl script (via aws glue )-> output(s3,parquet files )
My question is lets assume the above is initial load of the data ,how do I setup to run incremental batches that come every day(or every hour) which add new rows or update existing records
a.) how do I keep appending to the same s3 parquet files. so that subsequent presto db queries result in latest data.
b.) how to handle duplicate records get the latest timestamp for the queries.
In the spark script , do I need to create Hive external table with source as s3 and use in presto db ?
thanks for any inputs.