efficient way to compare two tables in bigquery
Asked Answered
A

3

36

I am interested in comparing, whether two tables contain the same data.

I could do it like this:

#standardSQL
SELECT
    key1, key2
FROM
(
    SELECT 
    table1.key1,
    table1.key2,
    table1.column1 - table2.column1 as col1,
    table1.col2 - table2.col2 as col2
    FROM
        `table1` AS table1
    LEFT JOIN
        `table2` AS table2
    ON
        table1.key1 = table2.key1
    AND
        table1.key2 = table2.key2
)
WHERE 
    col1 != 0
OR
    col2 != 0

But when I want to compare all numerical columns, this is kind of hard, especially if I want to do it for multiple table combinations.

Therefore my question: Is someone aware of a possibility to iterate over all numerical columns and restrict the result set to those keys where any of these differences where not zero?

Aschim answered 12/7, 2018 at 17:52 Comment(0)
L
13

First, I want to bring up issues with your original query

The main issues are 1) using LEFT JOIN ; 2) using col != 0

Below is how it should be modified to really capture ALL differences from both tables
Run your original query and below one - and hopefully you will see the difference

#standardSQL
SELECT key1, key2
FROM
(
    SELECT 
    IFNULL(table1.key1, table2.key1) key1,
    IFNULL(table1.key2, table2.key2) key2,
    table1.column1 - table2.column1 AS col1,
    table1.col2 - table2.col2 AS col2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR    IFNULL(col2, 1) != 0

or you can just try to run your original and above version against dummy data to see the difference

#standardSQL
WITH `table1` AS (
  SELECT 1 key1, 1 key2, 1 column1, 2 col2 UNION ALL
  SELECT 2, 2, 3, 4 UNION ALL
  SELECT 3, 3, 5, 6
), `table2` AS (
  SELECT 1 key1, 1 key2, 1 column1, 29 col2 UNION ALL
  SELECT 2, 2, 3, 4 UNION ALL
  SELECT 4, 4, 7, 8
)
SELECT key1, key2
FROM
(
    SELECT 
    IFNULL(table1.key1, table2.key1) key1,
    IFNULL(table1.key2, table2.key2) key2,
    table1.column1 - table2.column1 AS col1,
    table1.col2 - table2.col2 AS col2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR    IFNULL(col2, 1) != 0   

Secondly, below will highly simplify your overall query

#standardSQL
SELECT 
  IFNULL(table1.key1, table2.key1) key1,
  IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING(table1) != TO_JSON_STRING(table2)  

You can test it with the same dummy data example as above
Note: in this solution you don't need to pick specific columns - it just compare all columns! but if you need to compare only specific columns - you still will need to cherry-pick them like in below example

#standardSQL
SELECT 
  IFNULL(table1.key1, table2.key1) key1,
  IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING((table1.column1, table1.col2)) != TO_JSON_STRING((table2.column1, table2.col2))
Longlegged answered 12/7, 2018 at 18:34 Comment(2)
Thanks, this answers my problem (even if I still would be interested in only comparing numerical ones, but indeed this is a better approach)!Aschim
I found a Medium post on this, BigQuery Table Comparison, which discusses both full table comparison, using hashing, and key-wise comparisons. The full table compare is similar to the solution presented here.Intertype
P
68

In Standard SQL, we found using a UNION ALL of two EXCEPT DISTINCT's works for our use cases:

(
  SELECT * FROM table1
  EXCEPT DISTINCT
  SELECT * FROM table2
)

UNION ALL

(
  SELECT * FROM table2
  EXCEPT DISTINCT
  SELECT * FROM table1
)

This will produce differences in both directions:

  • rows in table1 that are not in table2
  • rows in table2 that are not in table1

Notes and caveats:

  • table1 and table2 must be of the same width and have columns in the same order and type.
  • this does not work directly with STRUCT or ARRAY data types. You should either UNNEST, or use TO_JSON_STRING to convert the these data types first.
  • this does not directly work with GEOGRAPHY either, you must cast to text first using ST_AsText
Poundal answered 12/9, 2019 at 18:37 Comment(2)
What's a good way to add a column to this which tells which table the data is from?Dumah
I modified to SELECT 'table1' AS Tbl, * FROM (SELECT ... EXCEPT DISTINCT ...) UNION ALL SELECT 'table2' AS Tbl, * FROM (SELECT ... EXCEPT DISTINCT ...) this gets the table source in the results too. BTW -- love this approach overallDumah
L
13

First, I want to bring up issues with your original query

The main issues are 1) using LEFT JOIN ; 2) using col != 0

Below is how it should be modified to really capture ALL differences from both tables
Run your original query and below one - and hopefully you will see the difference

#standardSQL
SELECT key1, key2
FROM
(
    SELECT 
    IFNULL(table1.key1, table2.key1) key1,
    IFNULL(table1.key2, table2.key2) key2,
    table1.column1 - table2.column1 AS col1,
    table1.col2 - table2.col2 AS col2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR    IFNULL(col2, 1) != 0

or you can just try to run your original and above version against dummy data to see the difference

#standardSQL
WITH `table1` AS (
  SELECT 1 key1, 1 key2, 1 column1, 2 col2 UNION ALL
  SELECT 2, 2, 3, 4 UNION ALL
  SELECT 3, 3, 5, 6
), `table2` AS (
  SELECT 1 key1, 1 key2, 1 column1, 29 col2 UNION ALL
  SELECT 2, 2, 3, 4 UNION ALL
  SELECT 4, 4, 7, 8
)
SELECT key1, key2
FROM
(
    SELECT 
    IFNULL(table1.key1, table2.key1) key1,
    IFNULL(table1.key2, table2.key2) key2,
    table1.column1 - table2.column1 AS col1,
    table1.col2 - table2.col2 AS col2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR    IFNULL(col2, 1) != 0   

Secondly, below will highly simplify your overall query

#standardSQL
SELECT 
  IFNULL(table1.key1, table2.key1) key1,
  IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING(table1) != TO_JSON_STRING(table2)  

You can test it with the same dummy data example as above
Note: in this solution you don't need to pick specific columns - it just compare all columns! but if you need to compare only specific columns - you still will need to cherry-pick them like in below example

#standardSQL
SELECT 
  IFNULL(table1.key1, table2.key1) key1,
  IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING((table1.column1, table1.col2)) != TO_JSON_STRING((table2.column1, table2.col2))
Longlegged answered 12/7, 2018 at 18:34 Comment(2)
Thanks, this answers my problem (even if I still would be interested in only comparing numerical ones, but indeed this is a better approach)!Aschim
I found a Medium post on this, BigQuery Table Comparison, which discusses both full table comparison, using hashing, and key-wise comparisons. The full table compare is similar to the solution presented here.Intertype
N
3

You will need to specify which are the numerical columns, but looking at a representation of all of them will do the fast compare:

#standardSQL
WITH table_a AS (
  SELECT 1 id, 2 n1, 3 n2
), table_b AS  (
  SELECT 1 id, 2 n1, 4 n2
)


SELECT id
FROM table_a a
JOIN table_b b
USING(id)
WHERE TO_JSON_STRING([a.n1, a.n2]) != TO_JSON_STRING([b.n1, b.n2])
Needlework answered 12/7, 2018 at 18:10 Comment(3)
so by fast compare do you mean this speeds the join up? But I still have to determine the numerical ones manually?Aschim
yeah, "fast" as in you'll have to write way less code, but SQL doesn't have a way to introspect into the type of the columns AFAIKNeedlework
unfortunately that's what I wanted... even if it doesn't answer my question, thanks for showing me this optimization possibility!Aschim

© 2022 - 2024 — McMap. All rights reserved.