sqlansi-sqlgoogle-bigquery

efficient way to compare two tables in bigquery


I am interested in comparing, whether two tables contain the same data.

I could do it like this:

#standardSQL
SELECT
    key1, key2
FROM
(
    SELECT 
    table1.key1,
    table1.key2,
    table1.column1 - table2.column1 as col1,
    table1.col2 - table2.col2 as col2
    FROM
        `table1` AS table1
    LEFT JOIN
        `table2` AS table2
    ON
        table1.key1 = table2.key1
    AND
        table1.key2 = table2.key2
)
WHERE 
    col1 != 0
OR
    col2 != 0

But when I want to compare all numerical columns, this is kind of hard, especially if I want to do it for multiple table combinations.

Therefore my question: Is someone aware of a possibility to iterate over all numerical columns and restrict the result set to those keys where any of these differences where not zero?


Solution

  • First, I want to bring up issues with your original query

    The main issues are 1) using LEFT JOIN ; 2) using col != 0

    Below is how it should be modified to really capture ALL differences from both tables
    Run your original query and below one - and hopefully you will see the difference

    #standardSQL
    SELECT key1, key2
    FROM
    (
        SELECT 
        IFNULL(table1.key1, table2.key1) key1,
        IFNULL(table1.key2, table2.key2) key2,
        table1.column1 - table2.column1 AS col1,
        table1.col2 - table2.col2 AS col2
        FROM `table1` AS table1
        FULL OUTER JOIN `table2` AS table2
        ON table1.key1 = table2.key1
        AND table1.key2 = table2.key2
    )
    WHERE IFNULL(col1, 1) != 0
    OR    IFNULL(col2, 1) != 0
    

    or you can just try to run your original and above version against dummy data to see the difference

    #standardSQL
    WITH `table1` AS (
      SELECT 1 key1, 1 key2, 1 column1, 2 col2 UNION ALL
      SELECT 2, 2, 3, 4 UNION ALL
      SELECT 3, 3, 5, 6
    ), `table2` AS (
      SELECT 1 key1, 1 key2, 1 column1, 29 col2 UNION ALL
      SELECT 2, 2, 3, 4 UNION ALL
      SELECT 4, 4, 7, 8
    )
    SELECT key1, key2
    FROM
    (
        SELECT 
        IFNULL(table1.key1, table2.key1) key1,
        IFNULL(table1.key2, table2.key2) key2,
        table1.column1 - table2.column1 AS col1,
        table1.col2 - table2.col2 AS col2
        FROM `table1` AS table1
        FULL OUTER JOIN `table2` AS table2
        ON table1.key1 = table2.key1
        AND table1.key2 = table2.key2
    )
    WHERE IFNULL(col1, 1) != 0
    OR    IFNULL(col2, 1) != 0   
    

    Secondly, below will highly simplify your overall query

    #standardSQL
    SELECT 
      IFNULL(table1.key1, table2.key1) key1,
      IFNULL(table1.key2, table2.key2) key2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
    WHERE TO_JSON_STRING(table1) != TO_JSON_STRING(table2)  
    

    You can test it with the same dummy data example as above
    Note: in this solution you don't need to pick specific columns - it just compare all columns! but if you need to compare only specific columns - you still will need to cherry-pick them like in below example

    #standardSQL
    SELECT 
      IFNULL(table1.key1, table2.key1) key1,
      IFNULL(table1.key2, table2.key2) key2
    FROM `table1` AS table1
    FULL OUTER JOIN `table2` AS table2
    ON table1.key1 = table2.key1
    AND table1.key2 = table2.key2
    WHERE TO_JSON_STRING((table1.column1, table1.col2)) != TO_JSON_STRING((table2.column1, table2.col2))