mysqlsqlmultiple-tables

Data inconsistencies between two tables


I have an SQL question in which I am struggling to understand and find relevant resources to help me.

The question is: "Write an SQL query to identify data inconsistencies between two tables."

I need to compare the following tables of data:

AssetManager

AssetManagerName
John Doe
Joe Smith
Dave Grey
Lisa Sparks
Kate Green

Trip

PropertyCode AssetManagerName Date
P001 John Doe 2022-01-22
P001 Joe Smith 2022-01-19
P002 Dave Grey 2022-02-25
P002 John Doe 2022-04-23
P003 Kate Greens 2022-02-25
P004 Joe Smith 2022-05-29
P002 Dave Grey 2022-01-25
P001 John Doe 2022-02-24

Image translated to text from Original Source

What are the inconsistencies in this case? Is it maybe that "Kate Green" is in the AssetManager table, and you have "Kate Greens" in the Trip table? That's the only thing I can see.

What MySQL commands could I use that would help me to achieve this query?


Solution

  • In SQL, when we talk about inconsistencies, we are generally referring to data that would not correctly translate into a normalised form, when we try to join between tables this would result in missing data or orphaned rows. Commonly inconsistencies arise when there is no referential constraints in a schema to maintain consistency. In such cases simple spelling mistakes can easily creep into the dataset, but entirely wrong values could also be used. In this case, If there is a table that represents all the possible Asset Managers, then we would expect that in other tables that refer to Asset Managers that only values from the Asset Managers table would be used, spelling mistakes and entirely missing names will be treated the same.

    In the Trip Table we can identify inconsistency with the AssetManager table by looking for any records in Trip that do not have a match in AssetManager using the AssetManagerName column.

    One simple way to do this is to use an OUTER JOIN and to exclude all the matches:

    SELECT Trip.*
    FROM Trip
    LEFT OUTER JOIN AssetManager ON Trip.AssetManagerName = AssetManager.AssetManagerName
    WHERE AssetManager.AssetManagerName IS NULL
    

    This returns the following result: (See db-fiddle)

    PropertyCode AssetManagerName Date
    P003 Kate Greens 2022-02-25

    The LEFT OUTER JOIN (or LEFT JOIN) will return all the rows from the Trip table, even if there is no corresponding match in the AssetManager table on the AssetManagerName column. For the rows that do not match, all the values for the AssetManager table in the result set will be NULL.

    We can then use a WHERE clause to exclude all the matches data records and only return those records that DO NOT MATCH, we do this by only allowing where AssetManager.AssetManagerName has a null value.

    There are no records in Trip with a legitimate null value in the AssetManagerName, the null only exists in the recordset at a result of the LEFT OUTER JOIN evaluation.

    You could also use a NOT EXISTS Clause, this syntax is sometimes easier to read and identify the intent, we want to find the records that DO NOT MATCH. But specifically in MySQL it's execution plan generally less efficient than the LEFT OUTER JOIN expression above.

    SELECT Trip.*
    FROM Trip
    WHERE NOT EXISTS (
        SELECT AssetManager.AssetManagerName
        FROM AssetManager
        WHERE AssetManager.AssetManagerName = Trip.AssetManagerName
    )
    

    Another variation of this is to use NOT IN. For this query we first evaluate a list of possible values for AssetManagerName and use that to identify the values that do not match.

    This is helpful when there might be some legitimate null values in either of the tables for AssetManagerName as IN handles NULL values differently to EXISTS

    SELECT Trip.*
    FROM Trip
    WHERE Trip.AssetManagerName NOT IN (
        SELECT AssetManager.AssetManagerName
        FROM AssetManager
    )
    

    For an interesting analysis of these options and performace considerations have a read over this article: NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL