project_id | calendar_date | has_sow | sow_activity_status | sow_review_activity_actual_date |
---|---|---|---|---|
P1 | 2024-03-01 | null | Not Completed | null |
P1 | 2024-03-01 | null | Not Completed | null |
P1 | 2024-03-01 | null | Not Completed | null |
P1 | 2024-03-01 | null | Not Completed | null |
P1 | 2024-04-01 | null | Not Completed | null |
P1 | 2024-04-01 | null | Not Completed | null |
P1 | 2024-04-01 | null | Not Completed | null |
P1 | 2024-04-01 | null | Not Completed | null |
P2 | 2024-03-01 | null | Completed | 2023-12-20 |
P2 | 2024-03-01 | null | Completed | 2023-12-20 |
P2 | 2024-03-01 | null | Completed | 2023-12-20 |
P2 | 2024-03-01 | null | Completed | 2023-12-20 |
P2 | 2024-04-01 | null | Completed | 2023-12-20 |
P2 | 2024-04-01 | null | Completed | 2023-12-20 |
P2 | 2024-04-01 | null | Completed | 2023-12-20 |
P2 | 2024-04-01 | null | Completed | 2023-12-20 |
The above table is the source data, the final expected output is :
project_id | calendar_date | has_sow | sow_activity_status | sow_review_activity_actual_date |
---|---|---|---|---|
P1 | 2024-03-01 | null | Not Completed | null |
P1 | 2024-04-01 | null | Not Completed | null |
P2 | 2024-03-01 | null | Completed | 2023-12-20 |
P2 | 2024-04-01 | null | Completed | 2023-12-20 |
The expected output can be achieved using distinct query, i.e.
select
distinct project_id, calendar_date, has_sow, sow_activity_status, sow_review_activity_actual_date
from source ;
Would like to achieve the same using ROW_NUMBER since the data is around a Million records. So will be partitioning the ROW_NUMBER window function by project_id, calendar_date but don't know on which column it should be ordered by because all the remaining columns will have the same data at the grain of the project_id and calendar_date ?
Need advice in figuring out the order by clause for ROW_NUMBER function.
Using DISTINCT
here seems like the right approach to me. I don't recommend using ROW_NUMBER
to remove your duplicates. As an alternative, you could use GROUP BY
:
SELECT project_id, calendar_date, has_sow, sow_activity_status, sow_review_activity_actual_date
FROM source
GROUP BY project_id, calendar_date, has_sow, sow_activity_status, sow_review_activity_actual_date;