I've got a table that has duplicate data that needs to be cleaned up. Consider the following example:
CREATE TABLE #StackOverFlow
(
[ctrc_num] int,
[Ctrc_name] varchar(6),
[docu] bit,
[adj] bit,
new bit,
[some_date] datetime
);
INSERT INTO #StackOverFlow
([ctrc_num], [Ctrc_name], [docu], [adj], [new], [some_date])
VALUES
(12345, 'John R', null, null, 1, '2023-12-11 09:05:13.003'),
(12345, 'John R', 1, null, 0, '2023-12-11 09:05:12.987'),
(12345, 'John R', null, null, 1, '2023-12-11 09:05:12.947'),
(56789, 'Sam S', null, null, 1, '2023-12-11 09:05:13.003'),
(56789, 'Sam S', null, null, 1, '2023-12-11 09:05:12.987'),
(56789, 'Sam S', 1, null, 0, '2023-12-11 09:05:12.947'),
(78945, 'Pat P', null, null, 1, '2023-12-11 09:05:13.003'),
(78945, 'Pat P', null, null, 1, '2023-12-11 09:05:12.987'),
(78945, 'Pat P', null, null, 1, '2023-12-11 09:05:12.947');
This gives me:
[ctrc_num] [Ctrc_name] [docu] [adj] [new] [some_date]
-----------------------------------------------------------------------
12345 John R NULL NULL 1 2023-12-11 09:05:13.003
12345 John R 1 NULL 0 2023-12-11 09:05:12.987
12345 John R NULL NULL 1 2023-12-11 09:05:12.947
56789 Sam S NULL NULL 1 2023-12-11 09:05:13.003
56789 Sam S NULL NULL 1 2023-12-11 09:05:12.987
56789 Sam S 1 NULL 0 2023-12-11 09:05:12.947
78945 Pat P NULL NULL 1 2023-12-11 09:05:13.003
78945 Pat P NULL NULL 1 2023-12-11 09:05:12.987
78945 Pat P NULL NULL 1 2023-12-11 09:05:12.947
What I need to do is delete from the table duplicates. If new is 0, delete the records where new is 1. If all records have new = 1 keep the newest record and delete the older ones.
The result should look like this:
[ctrc_num] [Ctrc_name] [docu] [adj] [new] [some_date]
-----------------------------------------------------------------------
12345 John R 1 NULL 0 2023-12-11 09:05:12.987
56789 Sam S 1 NULL 0 2023-12-11 09:05:12.947
78945 Pat P NULL NULL 1 2023-12-11 09:05:13.003
I've tried ROW_NUMBER:
;WITH RankedByDate AS
(
SELECT
ctrc_num, Ctrc_name,
docu, adj, new, some_date,
ROW_NUMBER() OVER (PARTITION BY Ctrc_num, Ctrc_name, [docu],[adj], [new]
ORDER BY some_date DESC) AS rNum
FROM
#StackOverFlow
)
SELECT *
FROM RankedByDate
This separates the ones with new = 0, but I still have the ones with new = 1 that are ordered.
Grouping gives me the records that are duplicated but no way to delete the ones needed to be deleted:
SELECT [ctrc_num]
,[Ctrc_name]
,[docu]
,[adj]
,[new]
FROM
#StackOverFlow
GROUP BY
[ctrc_num]
,[Ctrc_name]
,[docu]
,[adj]
,[new]
HAVING
COUNT(*) > 1
Break the problem down into it's parts
"If new is 0, delete the records where new is 1"
delete from #StackOverFlow
where [new] = 1
and [ctrc_num] in (select [ctrc_num]
from #StackOverFlow
where [new] = 0);
"If all records have new = 1 keep the newest record and delete the older ones" Use a CTE to add a row number based on the date and partitioned by the [ctrc_num] such that the "first" record in each group is the one you want to keep - if there is only 1 row in a group that's the one you want to keep anyway. Then delete everything else
;with cte as
(
select
[ctrc_num]
,ROW_NUMBER() OVER (PARTITION BY [ctrc_num] ORDER BY [ctrc_num], [some_date] DESC) as rw
from #StackOverFlow
)
DELETE FROM cte where rw <> 1;