I am looking to see if there is something like AWS Glue "bookmark" in spark. I know there is checkpoint in spark which works well on individual data source. In Glue we could use bookmark to keep track of all the files across different tables involved in the job using single bookmark.
Is there something like Glue "Bookmark" feature in spark which keeps track at job level?
Asked Answered
You can use Spark Structured Streaming in combination with Trigger.Once() for that.
The stream will essentially just run one micro stream batch, which is the same as a single batch, while leveraging the checkpointing capability which keeps track of the processed files
© 2022 - 2024 — McMap. All rights reserved.