sqlsnowflake-cloud-data-platform

Removing 'Duplicate' records where some columns data is null


In my Snowflake SQL query I have an output like below, all other fields contain the same data when grouped by Column A:

Column A Column B Column C Column D Tag
A NULL AAAA NULL 1
A BBBB AAAA CCCC
B DDDD NULL EEEE 3
B FFFF NULL GGGG 4
B HHHH NULL IIII 2
B HHHH JJJJ IIII

The query uses both SELECT DISTINCT and GROUP BY clauses to remove any duplication where it exists. I'm aware that these are not true 'duplicates' but I'd like to remove the rows i've tagged as 1 and 2 as they have NULLS, so my output should look like this:

Column A Column B Column C Column D
A BBBB AAAA CCCC
B DDDD NULL EEEE
B FFFF NULL GGGG
B HHHH JJJJ IIII

I've already tried using ROW_NUMBER(), PARTITION BY, RANK() etc to eliminate the rows i don't require, but as rows tagged 3 and 4 legitimately contain NULLS it also removes them. There are no other fields (for example datetime) that i could use to order by and remove based on them.

A slimmed down example of my current query is:

SELECT
COLUMN A,
COLUMN B,
COLUMN C,
COLUMN D,
COLUMN E,
COLUMN F
FROM
(
    WITH CTE1 AS
    (
        SELECT DISTINCT
        COLUMN A,
        COLUMN B,
        COLUMN C,
        COLUMN D,
        COLUMN E,
        COLUMN F
        FROM
        SOURCE_TABLE WHERE SOURCE = '1'
    ),
    CTE2 AS
    (
        SELECT DISTINCT
        COLUMN A,
        COLUMN B,
        COLUMN C,
        COLUMN D,
        COLUMN G,
        COLUMN H
        FROM
        SOURCE_TABLE WHERE SOURCE = '2'
    ),   
    CTE3 AS
    (
        SELECT DISTINCT
        COLUMN A,
        COLUMN B,
        COLUMN C,
        COLUMN D,
        COLUMN E,
        COLUMN F
        FROM
        SOURCE_TABLE WHERE SOURCE = '3'
    ),  
    BASE_CTE AS
    (
        SELECT DISTINCT
        COLUMN A,
        COLUMN B,
        COLUMN C,
        COLUMN D,
        MAX(COLUMN E) AS COLUMN E,
        LISTAGG(COLUMN F, ', ') AS COLUMN F,
        FROM
        (
            SELECT
            COLUMN A,
            COLUMN B,
            COLUMN C,
            COLUMN D,
            COLUMN E,
            COLUMN F
            FROM
            CTE1
            UNION
            (
                SELECT
                COLUMN A,
                COLUMN B,
                COLUMN C,
                COLUMN D,
                COLUMN E,
                COLUMN F
                FROM
                CTE2
            )
            UNION
            (
                SELECT
                COLUMN A,
                COLUMN B,
                COLUMN C,
                COLUMN D,
                COLUMN E,
                COLUMN F
                FROM
                CTE3
            )
        ) GROUP BY ALL
    )
    SELECT * FROM BASE_CTE
);

Solution

  • You can use the following approach:

    1. Do self left join on Column A (plus add a condition here where Columns B, C, D are not equal).
    2. Use the CASE WHEN to mark rows with 0 when left side of join is a duplicate of the right side; and with 0 otherwise.
    3. For each row, calculate if it's a duplicate at least once (MAX(dedup) OVER(...)), and filter by it (QUALIFY ...)
    4. Finally select only DISTINCT rows to avoid duplications.

    Two notes:

    1. This doesn't work if B, C, D are NULLs and it's the only row for the value of Column A. If that's the problem for you, it's possible to adjust the query. I didn't do it here, because I believe it will make code harder to understand and distract from the main idea.
    2. If you meant some other/additional conditions to consider a row the duplicate, feel free to adjust definition of dedup field.
    WITH data AS (
      SELECT 'A' AS col_a, NULL AS col_b, 'AAAA' AS col_c, NULL AS col_d
      UNION ALL
      SELECT 'A' AS col_a, 'BBBB' AS col_b, 'AAAA' AS col_c, 'CCCC' AS col_d
      UNION ALL
      SELECT 'B' AS col_a, 'DDDD' AS col_b, NULL AS col_c, 'EEEE' AS col_d
      UNION ALL
      SELECT 'B' AS col_a, 'FFFF' AS col_b, NULL AS col_c, 'GGGG' AS col_d
      UNION ALL
      SELECT 'B' AS col_a, 'HHHH' AS col_b, NULL AS col_c, 'IIII' AS col_d
      UNION ALL
      SELECT 'B' AS col_a, 'HHHH' AS col_b, 'JJJJ' AS col_c, 'IIII' AS col_d
      UNION ALL
      SELECT 'C' AS col_a, 'AAAA' AS col_b, 'BBBB' AS col_c, 'CCCC' AS col_d
      UNION ALL
      SELECT 'D' AS col_a, NULL AS col_b, NULL AS col_c, NULL AS col_d
      UNION ALL
      SELECT 'E' AS col_a, NULL AS col_b, NULL AS col_c, NULL AS col_d
      UNION ALL
      SELECT 'E' AS col_a, 'AAAA' AS col_b, 'BBBB' AS col_c, 'CCCC' AS col_d
      ),
    
    processed AS (
        SELECT
          d1.col_a,
          d1.col_b,
          d1.col_c,
          d1.col_d,
          CASE
            WHEN (
                (d1.col_b IS NULL OR d1.col_b = d2.col_b)
                AND (d1.col_c IS NULL OR d1.col_c = d2.col_c)
                AND (d1.col_d IS NULL OR d1.col_d = d2.col_d)
            )
            THEN 1
            ELSE 0
          END AS dedup
          FROM data d1
        LEFT JOIN data d2 ON d1.col_a = d2.col_a AND hash(d1.col_b, d1.col_c, d1.col_d) != hash(d2.col_b, d2.col_c, d2.col_d)
        QUALIFY MAX(dedup) OVER(PARTITION BY d1.col_a, d1.col_b, d1.col_c, d1.col_d) = 0
    )
    
    SELECT DISTINCT
        col_a,
        col_b,
        col_c,
        col_d
    FROM processed
    ORDER BY 1,2,3,4