What is the best way to migrate delta tables created in spark streaming jobs to a new location?

I have two pyspark streaming jobs:

streaming_job_a reads from kafka, writes a dataframe containing the raw data in one column and a timestamp in another column to location A in s3, and creates unmanaged delta table table_a using location A
streaming_job_b reads from delta table table_a, extracts the raw data into separate columns, writes to location B in s3, and creates unmanaged delta table table_b.

If I want to change the locations and table names used by both of these jobs, how do I do so in a way that preserves the data, doesn't cause problems with the checkpoints, and takes the least amount of time? Both tables have to be preserved because other teams read from both of them. The end result would ideally look like this:

streaming_job_a reads from kafka, writes to location A_new in s3 and creates delta table table_a_new
streaming_job_b reads from delta table table_a_new, writes to location B_new in s3, and creates delta table table_b_new.

I know I can read from the old location and write to the new location like this:

incoming_df = spark.readStream.format("delta").table("table_a")

writer_df = (
    incoming_df
    .writeStream.format("delta")
    .option("checkpointLocation", "A_new/_checkpoints")
    .option("path", "A_new")
    .trigger(once=True)
)

writer_df.start()

and then create the new table:

spark.sql("create table table_a_new using delta location 'A_new'")

and then do something similar for streaming_job_b, but in this approach I'm concerned about missing new data that gets written to location A while the migration for streaming_job_b takes place. I'm fairly new to spark streaming in general so any advice is greatly appreciated!

Recommended topics

Hot tags