I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time.
Between two dumps new entries might have been added and existing ones might have been modified or deleted. Assume the schema remains unchanged and that every entry has a unique ID.
What would be the best way to determine the delta between two of these datasets (including deletions and updates)?
My plan is to load everything to an RDBMS and go from there.
First, load the older dump. Then, load the newer dump into a different schema, but in doing so I'll check if the entry is new or is an update to an existing entry. If yes, I'll log the ID on a new table(s) called "changes."
After this is all done, I'll go through the old dump going through all entries and see if they have a matching record (ie: same ID) on the new dump. If not, log to changes.
Assuming looking up a record by ID is a O(log n)
operation, this should allow me to do everything in O(n log n)
time.
Because I can determine the difference by looking at presence or absence of records with just the ID and the last modification date, I could also load everything in main memory as well. The time complexity will be the same, but with the added benefit of less disk I/O, which should make this faster by orders of magnitude.
Suggestions? (Note: This is more of a performance question than anything)