Compare two DataTables to determine rows in one but not the other
Asked Answered
K

13

18

I have two DataTables, A and B, produced from CSV files. I need to be able to check which rows exist in B that do not exist in A.

Is there a way to do some sort of query to show the different rows or would I have to iterate through each row on each DataTable to check if they are the same? The latter option seems to be very intensive if the tables become large.

Kania answered 2/10, 2008 at 19:33 Comment(0)
C
10

would I have to iterate through each row on each DataTable to check if they are the same.

Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.

Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:

1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:

  • Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
  • Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.

This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )

2: An alternative approach, which may be more or less efficient depending on how big your data is:

  • Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
  • Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.

I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)

Count answered 2/10, 2008 at 19:55 Comment(0)
A
21

Assuming you have an ID column which is of an appropriate type (i.e. gives a hashcode and implements equality) - string in this example, which is slightly pseudocode because I'm not that familiar with DataTables and don't have time to look it all up just now :)

IEnumerable<string> idsInA = tableA.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> idsInB = tableB.AsEnumerable().Select(row => (string)row["ID"]);
IEnumerable<string> bNotA = idsInB.Except(idsInA);
Authorization answered 2/10, 2008 at 20:19 Comment(1)
I changed all the references to string to be references to int as my ID column was an integer primary keyPhilae
C
10

would I have to iterate through each row on each DataTable to check if they are the same.

Seeing as you've loaded the data from a CSV file, you're not going to have any indexes or anything, so at some point, something is going to have to iterate through every row, whether it be your code, or a library, or whatever.

Anyway, this is an algorithms question, which is not my specialty, but my naive approach would be as follows:

1: Can you exploit any properties of the data? Are all the rows in each table unique, and can you sort them both by the same criteria? If so, you can do this:

  • Sort both tables by their ID (using some useful thing like a quicksort). If they're already sorted then you win big.
  • Step through both tables at once, skipping over any gaps in ID's in either table. Matched ID's mean duplicated records.

This allows you to do it in (sort time * 2 ) + one pass, so if my big-O-notation is correct, it'd be (whatever-sort-time) + O(m+n) which is pretty good.
(Revision: this is the approach that ΤΖΩΤΖΙΟΥ describes )

2: An alternative approach, which may be more or less efficient depending on how big your data is:

  • Run through table 1, and for each row, stick it's ID (or computed hashcode, or some other unique ID for that row) into a dictionary (or hashtable if you prefer to call it that).
  • Run through table 2, and for each row, see if the ID (or hashcode etc) is present in the dictionary. You're exploiting the fact that dictionaries have really fast - O(1) I think? lookup. This step will be really fast, but you'll have paid the price doing all those dictionary inserts.

I'd be really interested to see what people with better knowledge of algorithms than myself come up with for this one :-)

Count answered 2/10, 2008 at 19:55 Comment(0)
P
7

You can use the Merge and GetChanges methods on the DataTable to do this:

A.Merge(B); // this will add to A any records that are in B but not A
return A.GetChanges(); // returns records originally only in B
Possie answered 2/10, 2008 at 19:46 Comment(5)
I've been trying this and getting a null result, even though A and B were different (A was empty, B had data)Thyroid
@Chad: In your case, A may be empty of both data and columns/fields. A and B have to have the same columns for this method to work.Possie
When using the Merge it does not flag the DataRowStateAnthraquinone
If you do not set the PrimaryKey value in each datatable B will simply be added to A and the Merge will not work. Use something like this A.PrimaryKey = new DataColumn[] { A.Columns["ID"] }; from this SO answer: #3622221Philae
Also, when you add the primary key, just before invoking A.GetChanges(), you should invoke A.AcceptChanges(); The value now returned will now only contain the rows that are different between the 2 DataTables.Philae
M
4

The answers so far assume that you're simply looking for duplicate primary keys. That's a pretty easy problem - you can use the Merge() method, for instance.

But I understand your question to mean that you're looking for duplicate DataRows. (From your description of the problem, with both tables being imported from CSV files, I'd even assume that the original rows didn't have primary key values, and that any primary keys are being assigned via AutoNumber during the import.)

The naive implementation (for each row in A, compare its ItemArray with that of each row in B) is indeed going to be computationally expensive.

A much less expensive way to do this is with a hashing algorithm. For each DataRow, concatenate the string values of its columns into a single string, and then call GetHashCode() on that string to get an int value. Create a Dictionary<int, DataRow> that contains an entry, keyed on the hash code, for each DataRow in DataTable B. Then, for each DataRow in DataTable A, calculate the hash code, and see if it's contained in the dictionary. If it's not, you know that the DataRow doesn't exist in DataTable B.

This approach has two weaknesses that both emerge from the fact that two strings can be unequal but produce the same hash code. If you find a row in A whose hash is in the dictionary, you then need to check the DataRow in the dictionary to verify that the two rows are really equal.

The second weakness is more serious: it's unlikely, but possible, that two different DataRows in B could hash to the same key value. For this reason, the dictionary should really be a Dictionary<int, List<DataRow>>, and you should perform the check described in the previous paragraph against each DataRow in the list.

It takes a fair amount of work to get this working, but it's an O(m+n) algorithm, which I think is going to be as good as it gets.

Mature answered 2/10, 2008 at 20:29 Comment(0)
W
1

Just FYI:

Generally speaking about algorithms, comparing two sets of sortable (as ids typically are) is not an O(M*N/2) operation, but O(M+N) if the two sets are ordered. So you scan one table with a pointer to the start of the other, and:

other_item= A.first()
only_in_B= empty_list()
for item in B:
    while other_item > item:
        other_item= A.next()
        if A.eof():
             only_in_B.add( all the remaining B items)
             return only_in_B
    if item < other_item:
         empty_list.append(item)
return only_in_B

The code above is obviously pseudocode, but should give you the general gist if you decide to code it yourself.

Wittenberg answered 2/10, 2008 at 19:47 Comment(0)
K
1

Thanks for all the feedback.

I do not have any index's unfortunately. I will give a little more information about my situation.

We have a reporting program (replaced Crystal reports) that is installed in 7 Servers across EU. These servers have many reports on them (not all the same for each country). They are invoked by a commandline application that uses XML files for their configuration. So One XML file can call multiple reports.

The commandline application is scheduled and controlled by our overnight process. So the XML file could be called from multiple places.

The goal of the CSV is to produce a list of all the reports that are being used and where they are being called from.

I am going through the XML files for all references, querying the scheduling program and producing a list of all the reports. (this is not too bad).

The problem I have is I have to keep a list of all the reports that might have been removed from production. So I need to compare the old CSV with the new data. For this I thought it best to put it into DataTables and compare the information, (this could be the wrong approach. I suppose I could create an object that holds it and compares the difference then create iterate through them).

The data I have about each report is as follows:

String - Task Name String - Action Name Int - ActionID (the Action ID can be in multiple records as a single action can call many reports, i.e. an XML file). String - XML File called String - Report Name

I will try the Merge idea given by MusiGenesis (thanks). (rereading some of the posts not sure if the Merge will work, but worth trying as I have not heard about it before so something new to learn).

The HashCode Idea sounds interesting as well.

Thanks for all the advice.

Kania answered 3/10, 2008 at 10:36 Comment(2)
Using Merge and GetChanges will work (I just tested it). It's not the most efficient solution, but unless your CSV files are huge it won't be a problem.Possie
Thank you Musi... I would have couple of hundred rows at a time so Merge sounds good.Kania
M
1
public DataTable compareDataTables(DataTable First, DataTable Second)
{
        First.TableName = "FirstTable";
        Second.TableName = "SecondTable";

        //Create Empty Table
        DataTable table = new DataTable("Difference");
        DataTable table1 = new DataTable();
        try
        {
            //Must use a Dataset to make use of a DataRelation object
            using (DataSet ds4 = new DataSet())
            {
                //Add tables
                ds4.Tables.AddRange(new DataTable[] { First.Copy(), Second.Copy() });

                //Get Columns for DataRelation
                DataColumn[] firstcolumns = new DataColumn[ds4.Tables[0].Columns.Count];
                for (int i = 0; i < firstcolumns.Length; i++)
                {
                    firstcolumns[i] = ds4.Tables[0].Columns[i];
                }
                DataColumn[] secondcolumns = new DataColumn[ds4.Tables[1].Columns.Count];
                for (int i = 0; i < secondcolumns.Length; i++)
                {
                    secondcolumns[i] = ds4.Tables[1].Columns[i];
                }
                //Create DataRelation
                DataRelation r = new DataRelation(string.Empty, firstcolumns, secondcolumns, false);
                ds4.Relations.Add(r);
                //Create columns for return table
                for (int i = 0; i < First.Columns.Count; i++)
                {
                    table.Columns.Add(First.Columns[i].ColumnName, First.Columns[i].DataType);
                }
                //If First Row not in Second, Add to return table.
                table.BeginLoadData();
                foreach (DataRow parentrow in ds4.Tables[0].Rows)
                { 
                    DataRow[] childrows = parentrow.GetChildRows(r);

                    if (childrows == null || childrows.Length == 0)
                        table.LoadDataRow(parentrow.ItemArray, true);
                    table1.LoadDataRow(childrows, false);

                }
                table.EndLoadData();
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
        return table;
}
Megdal answered 18/3, 2009 at 5:0 Comment(0)
R
1

I found an easy way to solve this. Unlike previous "except method" answers, I use the except method twice. This not only tells you what rows were deleted but what rows were added. If you only use one except method - it will only tell you one difference and not both. This code is tested and works. See below

//Pass in your two datatables into your method

        //build the queries based on id.
        var qry1 = datatable1.AsEnumerable().Select(a => new { ID = a["ID"].ToString() });
        var qry2 = datatable2.AsEnumerable().Select(b => new { ID = b["ID"].ToString() });


        //detect row deletes - a row is in datatable1 except missing from datatable2
        var exceptAB = qry1.Except(qry2);

        //detect row inserts - a row is in datatable2 except missing from datatable1
        var exceptAB2 = qry2.Except(qry1);

then execute your code against the results

        if (exceptAB.Any())
        {
            foreach (var id in exceptAB)
            {
   //execute code here
            }


        }
        if (exceptAB2.Any())
        {
            foreach (var id in exceptAB2)
            {
//execute code here
            }



        }
Resoluble answered 9/7, 2013 at 22:48 Comment(0)
R
1

Could you not simply compare the CSV files before loading them into DataTables?

string[] a = System.IO.File.ReadAllLines(@"cvs_a.txt");
string[] b = System.IO.File.ReadAllLines(@"csv_b.txt");

// get the lines from b that are not in a
IEnumerable<string> diff = b.Except(a);

//... parse b into DataTable ...
Renz answered 30/6, 2016 at 18:28 Comment(0)
M
0
        try
        {
            if (ds.Tables[0].Columns.Count == ds1.Tables[0].Columns.Count)
            {
               for (int i = 0; i < ds.Tables[0].Rows.Count; i++)
               {
                    for (int j = 0; j < ds.Tables[0].Columns.Count; j++)
                   {
       if (ds.Tables[0].Rows[i][j].ToString() == ds1.Tables[0].Rows[i][j].ToString())
                       {


                        }
                        else
                        {

                           MessageBox.Show(i.ToString() + "," + j.ToString());


                       }

                                               }

                }

            }
            else
            {
               MessageBox.Show("Table has different columns ");
            }
        }
        catch (Exception)
        {
           MessageBox.Show("Please select The Table");
        }
Megdal answered 18/3, 2009 at 5:3 Comment(0)
S
0

I'm continuing tzot's idea ...

If you have two sortable sets, then you can just use:

List<string> diffList = new List<string>(sortedListA.Except(sortedListB));

If you need more complicated objects, you can define a comparator yourself and still use it.

Stillhunt answered 29/1, 2012 at 18:33 Comment(0)
T
0

The usual usage scenario considers a user that has a DataTable in hand and changes it by Adding, Deleting or Modifying some of the DataRows.

After the changes are performed, the DataTable is aware of the proper DataRowState for each row, and also keeps track of the Original DataRowVersion for any rows that were changed.

In this usual scenario, one can Merge the changes back into a source table (in which all rows are Unchanged). After merging, one can get a nice summary of only the changed rows with a call to GetChanges().

In a more unusual scenario, a user has two DataTables with the same schema (or perhaps only the same columns and lacking primary keys). These two DataTables consist of only Unchanged rows. The user may want to find out what changes does he need to apply to one of the two tables in order to get to the other one. That is, which rows need to be Added, Deleted, or Modified.

We define here a function called GetDelta() which does the job:

using System;
using System.Data;
using System.Xml;
using System.Linq;
using System.Collections.Generic;
using System.Data.DataSetExtensions;

public class Program
{
    private static DataTable GetDelta(DataTable table1, DataTable table2)
    {
        // Modified2 : row1 keys match rowOther keys AND row1 does not match row2:
        IEnumerable<DataRow> modified2 = (
            from row1 in table1.AsEnumerable()
            from row2 in table2.AsEnumerable()
            where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
                  && !row1.ItemArray.SequenceEqual(row2.ItemArray)
            select row2);

        // Modified1 :
        IEnumerable<DataRow> modified1 = (
            from row1 in table1.AsEnumerable()
            from row2 in table2.AsEnumerable()
            where table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row2[keycol.Ordinal]))
                  && !row1.ItemArray.SequenceEqual(row2.ItemArray)
            select row1);

        // Added : row2 not in table1 AND row2 not in modified2
        IEnumerable<DataRow> added = table2.AsEnumerable().Except(modified2, DataRowComparer.Default).Except(table1.AsEnumerable(), DataRowComparer.Default);

        // Deleted : row1 not in row2 AND row1 not in modified1
        IEnumerable<DataRow> deleted = table1.AsEnumerable().Except(modified1, DataRowComparer.Default).Except(table2.AsEnumerable(), DataRowComparer.Default);


        Console.WriteLine();
        Console.WriteLine("modified count =" + modified1.Count());
        Console.WriteLine("added count =" + added.Count());
        Console.WriteLine("deleted count =" + deleted.Count());

        DataTable deltas = table1.Clone();

        foreach (DataRow row in modified2)
        {
            // Match the unmodified version of the row via the PrimaryKey
            DataRow matchIn1 = modified1.Where(row1 =>  table1.PrimaryKey.Aggregate(true, (boolAggregate, keycol) => boolAggregate & row1[keycol].Equals(row[keycol.Ordinal]))).First();
            DataRow newRow = deltas.NewRow();

            // Set the row with the original values
            foreach(DataColumn dc in deltas.Columns)
                newRow[dc.ColumnName] = matchIn1[dc.ColumnName];
            deltas.Rows.Add(newRow);
            newRow.AcceptChanges();

            // Set the modified values
            foreach (DataColumn dc in deltas.Columns)
                newRow[dc.ColumnName] = row[dc.ColumnName];
            // At this point newRow.DataRowState should be : Modified
        }

        foreach (DataRow row in added)
        {
            DataRow newRow = deltas.NewRow();
            foreach (DataColumn dc in deltas.Columns)
                newRow[dc.ColumnName] = row[dc.ColumnName];
            deltas.Rows.Add(newRow);
            // At this point newRow.DataRowState should be : Added
        }


        foreach (DataRow row in deleted)
        {
            DataRow newRow = deltas.NewRow();
            foreach (DataColumn dc in deltas.Columns)
                newRow[dc.ColumnName] = row[dc.ColumnName];
            deltas.Rows.Add(newRow);
            newRow.AcceptChanges();
            newRow.Delete();
            // At this point newRow.DataRowState should be : Deleted
        }

        return deltas;
    }

    private static void DemonstrateGetDelta()
    {
        DataTable table1 = new DataTable("Items");

        // Add columns
        DataColumn column1 = new DataColumn("id1", typeof(System.Int32));
        DataColumn column2 = new DataColumn("id2", typeof(System.Int32));
        DataColumn column3 = new DataColumn("item", typeof(System.Int32));
        table1.Columns.Add(column1);
        table1.Columns.Add(column2);
        table1.Columns.Add(column3);

        // Set the primary key column.
        table1.PrimaryKey = new DataColumn[] { column1, column2 };


        // Add some rows.
        DataRow row;
        for (int i = 0; i <= 4; i++)
        {
            row = table1.NewRow();
            row["id1"] = i;
            row["id2"] = i*i;
            row["item"] = i;
            table1.Rows.Add(row);
        }

        // Accept changes.
        table1.AcceptChanges();
        PrintValues(table1, "table1:");

        // Create a second DataTable identical to the first.
        DataTable table2 = table1.Clone();

        // Add a row that exists in table1:
        row = table2.NewRow();
        row["id1"] = 0;
        row["id2"] = 0; 
        row["item"] = 0;
        table2.Rows.Add(row);

        // Modify the values of a row that exists in table1:
        row = table2.NewRow();
        row["id1"] = 1;
        row["id2"] = 1;
        row["item"] = 455;
        table2.Rows.Add(row);

        // Modify the values of a row that exists in table1:
        row = table2.NewRow();
        row["id1"] = 2;
        row["id2"] = 4;
        row["item"] = 555;
        table2.Rows.Add(row);

        // Add a row that does not exist in table1:
        row = table2.NewRow();
        row["id1"] = 13;
        row["id2"] = 169;
        row["item"] = 655;
        table2.Rows.Add(row);

        table2.AcceptChanges();

        Console.WriteLine();
        PrintValues(table2, "table2:");

        DataTable delta = GetDelta(table1,table2);

        Console.WriteLine();
        PrintValues(delta,"delta:");

        // Verify that the deltas DataTable contains the adequate Original DataRowVersions:
        DataTable originals = table1.Clone();
        foreach (DataRow drow in delta.Rows)
        {
            if (drow.RowState != DataRowState.Added)
            {
                DataRow originalRow = originals.NewRow();
                foreach (DataColumn dc in originals.Columns)
                    originalRow[dc.ColumnName] = drow[dc.ColumnName, DataRowVersion.Original];
                originals.Rows.Add(originalRow);
            }
        }
        originals.AcceptChanges();

        Console.WriteLine();
        PrintValues(originals,"delta original values:");
    }

    private static void Row_Changed(object sender, 
        DataRowChangeEventArgs e)
    {
        Console.WriteLine("Row changed {0}\t{1}", 
            e.Action, e.Row.ItemArray[0]);
    }

    private static void PrintValues(DataTable table, string label)
    {
        // Display the values in the supplied DataTable:
        Console.WriteLine(label);
        foreach (DataRow row in table.Rows)
        {
            foreach (DataColumn col in table.Columns)
            {
                Console.Write("\t " + row[col, row.RowState == DataRowState.Deleted ? DataRowVersion.Original : DataRowVersion.Current].ToString());
            }
            Console.Write("\t DataRowState =" + row.RowState);
            Console.WriteLine();
        }
    }

    public static void Main()
    {
        DemonstrateGetDelta();
    }
}

The code above can be tested in https://dotnetfiddle.net/. The resulting output is shown below:

table1:
     0     0     0     DataRowState =Unchanged
     1     1     1     DataRowState =Unchanged
     2     4     2     DataRowState =Unchanged
     3     9     3     DataRowState =Unchanged
     4     16     4     DataRowState =Unchanged

table2:
     0     0     0     DataRowState =Unchanged
     1     1     455     DataRowState =Unchanged
     2     4     555     DataRowState =Unchanged
     13     169     655     DataRowState =Unchanged

modified count =2
added count =1
deleted count =2

delta:
     1     1     455     DataRowState =Modified
     2     4     555     DataRowState =Modified
     13     169     655     DataRowState =Added
     3     9     3     DataRowState =Deleted
     4     16     4     DataRowState =Deleted

delta original values:
     1     1     1     DataRowState =Unchanged
     2     4     2     DataRowState =Unchanged
     3     9     3     DataRowState =Unchanged
     4     16     4     DataRowState =Unchanged

Note that if your tables don't have a PrimaryKey, the where clause in the LINQ queries gets simplified a little bit. I'll let you figure that out on your own.

Tims answered 13/8, 2015 at 9:11 Comment(0)
C
0

Achieve it simply using linq.

private DataTable CompareDT(DataTable TableA, DataTable TableB)
    {
        DataTable TableC = new DataTable();
        try
        {

            var idsNotInB = TableA.AsEnumerable().Select(r => r.Field<string>(Keyfield))
            .Except(TableB.AsEnumerable().Select(r => r.Field<string>(Keyfield)));
            TableC = (from row in TableA.AsEnumerable()
                      join id in idsNotInB
                      on row.Field<string>(ddlColumn.SelectedItem.ToString()) equals id
                      select row).CopyToDataTable();
        }
        catch (Exception ex)
        {
            lblresult.Text = ex.Message;
            ex = null;
         }
        return TableC;

    }
Carious answered 26/5, 2016 at 8:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.