How can I determine the difference between two large datasets?

Asked 6/9, 2011 at 17:35 Answered 6/9, 2011 at 19:45

I have large datasets with millions of records in XML format. These datasets are full data dumps of a database up to a certain point in time.

Between two dumps new entries might have been added and existing ones might have been modified or deleted. Assume the schema remains unchanged and that every entry has a unique ID.

What would be the best way to determine the delta between two of these datasets (including deletions and updates)?

My plan is to load everything to an RDBMS and go from there.

First, load the older dump. Then, load the newer dump into a different schema, but in doing so I'll check if the entry is new or is an update to an existing entry. If yes, I'll log the ID on a new table(s) called "changes."

After this is all done, I'll go through the old dump going through all entries and see if they have a matching record (ie: same ID) on the new dump. If not, log to changes.

Assuming looking up a record by ID is a O(log n) operation, this should allow me to do everything in O(n log n) time.

Because I can determine the difference by looking at presence or absence of records with just the ID and the last modification date, I could also load everything in main memory as well. The time complexity will be the same, but with the added benefit of less disk I/O, which should make this faster by orders of magnitude.

Suggestions? (Note: This is more of a performance question than anything)

Admonitory answered 6/9, 2011 at 17:35 Comment(1)

"Because I can determine...which should make this faster by orders of magnitude". "This is more of a performance question than anything". ...sooo doing this in memory will be much quicker, and you're primarily concerned with performance. Sounds like you answered your own question. – Earhart 6/9, 2011 at 20:38

Look at DeltaXML.

(padded because StackOverflow doesn't allow short answers)

Hiss answered 6/9, 2011 at 19:45 Comment(0)

RedGate's SQL Data Compare

Eatage answered 6/9, 2011 at 17:51 Comment(0)

As an unusual suggestion, consider using git for this. Bring the first dataset under version control, then clean your working directory and copy in the second dataset. git is damn fast at bringing up the difference.

Mesocarp answered 6/9, 2011 at 17:48 Comment(3)

Can git handle that if the records are in no particular order (ie: the order is not guaranteed to stay the same)? – Admonitory 6/9, 2011 at 17:57

@NullUserException: git works on file structures. If you're talking about the Stack Overflow export, you could store each question XML in a file questionid.xml (not sure, never looked at the export in detail.) – Mesocarp 6/9, 2011 at 18:4

All the questions are in the same XML file... I really want to avoid creating millions xml files... – Admonitory 6/9, 2011 at 18:6

Take a look at this post on MSDN, which provides a solution for getting the differences between two DataTables. It should point you in the right direction:

How to compare two DataTables:
http://social.msdn.microsoft.com/Forums/en/csharpgeneral/thread/23703a85-20c7-4759-806a-fabf4e9f5be6

You might also want to take a look at this SO question too:
Compare two DataTables to determine rows in one but not the other

I've also seen this approach used a few times:

table1.Merge(table2);
DataTable changesTable = table1.GetChanges();

Glorious answered 6/9, 2011 at 17:50 Comment(0)

select
    coalesce(a.id, b.id) as id,
    case 
        when a.id is null then 'included' 
        when b.id is null then 'deleted'
        when a.col != b.col then 'updated'
    end as status
from a
full outer join b on a.id = b.id
where a.id is null or b.id is null or a.col != b.col

Tod answered 6/9, 2011 at 18:0 Comment(5)

I know how to do it, I am more concerned about the performance of a query like this. – Admonitory 6/9, 2011 at 18:2

@Null The title asks how to determine the difference not how to do it fast. Also it looks like you want to create a loop and that would be bad. – Tod 6/9, 2011 at 18:18

How do you suggest I load the data without a loop? – Admonitory 6/9, 2011 at 18:19

@Null Just use the query I wrote. No loops. – Tod 6/9, 2011 at 18:20

The data is in XML files, not a database. If I am going to use a database, I have to go through the files at some point. – Admonitory 6/9, 2011 at 18:22

Recommended topics

Hot tags