Retaining Delta log transaction data of Delta Lake forever
Asked Answered
L

2

6

I had a small confusion on transactional log of Delta lake. In the documentation it is mentioned that by default retention policy is 30 days and can be modified by property -: delta.logRetentionDuration=interval-string . But I don't understand when the actual log files are deleted from the delta_log folder. Is it when we run some operation? Or may be VACCUM operation. However, it is mentioned that VACCUM operation only deletes data files and not logs. But will it delete logs older than specified log retention duration?

reference -: https://docs.databricks.com/delta/delta-batch.html#data-retention

Longsighted answered 29/12, 2020 at 3:14 Comment(1)
Just to add on it, how can we set retention of Delta log transaction data forever?Longsighted
T
3

delta-io/delta PROTOCOL.md:

By default, the reference implementation creates a checkpoint every 10 commits.

There is an async process that runs for every 10th commit to the _delta_log folder. It will create a checkpoint file and will clean up the .crc and .json files that are older than the delta.logRetentionDuration.

Checkpoints.scala has checkpoint > checkpointAndCleanupDeltaLog > doLogCleanup. MeetadataCleanup.scala has doLogCleanup > cleanUpExpiredLogs.

Tamarisk answered 23/9, 2022 at 20:13 Comment(0)
A
2

The value of the option is an interval literal. There is no way to specify literal infinite and months and years are not allowed for this particular option (for a reason). However nothing stops you from saying interval 1000000000 weeks - 19 million years is effectively infinite.

Aquaplane answered 27/9, 2022 at 8:39 Comment(1)
I would clarify here that the "reason" applies only to specifying months or years, not to the fact that it must be finite. I really, really wish there was an option for infinite log (and data) retention and still haven't found a reason against it.Killoran

© 2022 - 2024 — McMap. All rights reserved.