sqlduplicatessnowflake-cloud-data-platformwindow-functions

Deduplicate the data using ROW_NUMBER in Snowflake when all the columns have the same data


project_id calendar_date has_sow sow_activity_status sow_review_activity_actual_date
P1 2024-03-01 null Not Completed null
P1 2024-03-01 null Not Completed null
P1 2024-03-01 null Not Completed null
P1 2024-03-01 null Not Completed null
P1 2024-04-01 null Not Completed null
P1 2024-04-01 null Not Completed null
P1 2024-04-01 null Not Completed null
P1 2024-04-01 null Not Completed null
P2 2024-03-01 null Completed 2023-12-20
P2 2024-03-01 null Completed 2023-12-20
P2 2024-03-01 null Completed 2023-12-20
P2 2024-03-01 null Completed 2023-12-20
P2 2024-04-01 null Completed 2023-12-20
P2 2024-04-01 null Completed 2023-12-20
P2 2024-04-01 null Completed 2023-12-20
P2 2024-04-01 null Completed 2023-12-20

The above table is the source data, the final expected output is :

project_id calendar_date has_sow sow_activity_status sow_review_activity_actual_date
P1 2024-03-01 null Not Completed null
P1 2024-04-01 null Not Completed null
P2 2024-03-01 null Completed 2023-12-20
P2 2024-04-01 null Completed 2023-12-20

The expected output can be achieved using distinct query, i.e.

select 
    distinct project_id, calendar_date, has_sow, sow_activity_status, sow_review_activity_actual_date
from source ;

Would like to achieve the same using ROW_NUMBER since the data is around a Million records. So will be partitioning the ROW_NUMBER window function by project_id, calendar_date but don't know on which column it should be ordered by because all the remaining columns will have the same data at the grain of the project_id and calendar_date ?

Need advice in figuring out the order by clause for ROW_NUMBER function.


Solution

  • Using DISTINCT here seems like the right approach to me. I don't recommend using ROW_NUMBER to remove your duplicates. As an alternative, you could use GROUP BY:

    SELECT project_id, calendar_date, has_sow, sow_activity_status, sow_review_activity_actual_date
    FROM source
    GROUP BY project_id, calendar_date, has_sow, sow_activity_status, sow_review_activity_actual_date;