sqlsql-servert-sql

Get duplicate rows from one table where an additional key does not appear in a related table


Single sentence to explain the scenario. I work for a pharmacy chain, our patient management software supplier allows the creation of duplicate patients without check, I need to find them and pass to each branch to remove.

I need to get a list of all patients in a branch that is a duplicate and doesn't appear in a separate table which shows they have been given a script. i.e. if they exist more than once and never got a script they are an empty patient record and need deleted.

Simplified tables

Patient

PatientId CardId Surname Forenames DoB Postcode
123 xyz Doe John 1970-01-01 AB312EX
234 abc Smith Alan 1950-02-01 AB311AB
456 fgh Doe John NULL NULL
789 uvw Doe John NULL NULL
098 qwe Jones Tom NULL NULL

Script

ScriptId PatientId DateSupplied
1 234 2025-03-20
2 123 2025-03-21
3 098 2025-03-24

As you see the branch can provide as much patient info as they choose, the patientId and CardId is created by the system, NULLs in columns such as DoB or Postcode, won't prevent a script from being generated.

In that example, I would need only John Doe to appear for PatientId 456 and 789 as they are identical across the last 4 columns and their patientId doesn't appear in the script table.

This is what I thought would be the first part, and it seems to show correctly all the patients appearing more than once

WITH cte AS (
    SELECT p.Surname, p.ForeNames, p.DateOfBirth, p.PostCode, COUNT(*) AS [Records]
    FROM pmr.Patient p
    WHERE p.Branch = 9
    GROUP BY p.Surname, p.ForeNames, p.DateOfBirth, p.PostCode
    HAVING COUNT(*) > 1
)
SELECT * FROM cte ORDER BY Surname, Forenames

After this I add the details I need in the select and join back to the cte. I need the CardId to allow the branch to find the correct record and I need the patientId to link to the Script table but can't use them in a group by count as it is unique for each patient record.

WITH cte AS (
    SELECT p.Surname, p.ForeNames, p.DateOfBirth, p.PostCode, COUNT(*) AS [Records]
    FROM pmr.Patient p
    WHERE p.Branch = 9
    GROUP BY p.Surname, p.ForeNames, p.DateOfBirth, p.PostCode
    HAVING COUNT(*) > 1
)
SELECT 
p.CardId, p.Surname, p.ForeNames, p.DateOfBirth, p.PostCode
FROM pmr.Patient p
    INNER JOIN cte ON 
        cte.Surname = p.Surname AND
        cte.Forenames = p.ForeNames AND 
        cte.DateOfBirth = p.DateOfBirth AND
        cte.Postcode = p.PostCode
    ORDER BY p.Surname, p.Forenames

This however returns some rows from the first query but not all of them.

Can someone help point me in the right direction and advise how I would then only return the duplicate if it's patientId isn't in the script table.


Solution

  • Apologies, I wasn't clear so I created a debate around data validation, as I say my hands are tied to the branch system we are supplied, and being NHS, the options are limited.

    Thanks to the suggestion from @ThomA, I used a windowed function Count to get what I was looking for, as mentioned from Alan, the second part was a simple join with a null check so here is the final query to obtain the list.

    WITH cte AS (
    SELECT PatientId, CardId, Surname, Forenames, DateOfBirth, PostCode, Branch,
    COUNT(CardId) OVER(PARTITION BY Surname, ForeNames, DateOfBirth, PostCode) as [records]
    FROM pmr.Patient
    WHERE Branch = 9
    
    )
    SELECT cte.* FROM cte
    LEFT JOIN pmr.[Session] s ON cte.PatientId = s.Patient AND cte.Branch = s.Branch
    WHERE records > 1 AND s.Patient IS NULL
    ORDER BY Surname, Forenames