sqlsql-servert-sql

Сomplex delete duplicates


I have two tables: Players and Games (below). I need to delete the duplicate player names using a query with a Using query row_number() function, but I need to leave the records that are in the Game table. If there are multiple records, then any of them can be left. If there are no records in Games table, then any of them can be left.

For example:

Players

ID Name
1 Oliver --> should be deleted (because ID=2 exists in Games table)
2 Oliver
3 Jack
4 Jack --> should be deleted (there is no Jack records in Games table, may leave any of them)
5 Jack --> should be deleted (there is no Jack records in Games table, may leave any of them)
6 Harry
7 Harry --> should be deleted (both records ID=6,7 are in Games table, may can delete either one)

Games

ID PlayerID
1 2
2 6
3 7

Result should be:

Players

ID Name
2 Oliver
3 Jack
6 Harry

Solution

  • The usual pattern for deleting duplicates is to number the rows having the same name and delete all rows having row number > 1.

    However, you have the additional criteria that you wish to retain all player rows having games. This can be done by:

    1. Identify all Player rows having any associated Games rows.
    2. Number the Player rows having the same name, but place those having games at the top of the number sequence.
    3. Delete all Player rows that both have row number > 1 and do not have an associated Games row.
    WITH PlayersHasGames AS (
        SELECT
            *,
            CASE WHEN EXISTS (SELECT * FROM Games G WHERE G.PlayerID = P.ID)
                THEN 1 ELSE 0 END
                AS HasGames
        FROM Players P
    ),
    NumberedPlayers AS (
        SELECT
            *,
            ROW_NUMBER() OVER(
                PARTITION BY Name
                ORDER BY HasGames DESC, ID  -- Players with games first
                ) AS RowNum
        FROM PlayersHasGames
    )
    DELETE
    FROM NumberedPlayers
    WHERE RowNum > 1
    AND HasGames = 0
    

    Results:

    ID Name
    2 Oliver
    3 Jack
    6 Harry
    7 Harry

    See this db<>fiddle for a demo.

    If you wish to delete one of the "Harry" rows, even though both have associated Games rows, you can exclude the AND HasGames = 0 condition. Note that this would violate referential integrity, as one of the Games rows would be an orphan. See this db<>fiddle.

    Renumbering the orphan rows would require an extra step to reassign PlayerId values after numbering the Players, but before deleting the duplicates.

    DECLARE @PlayerMapping TABLE (
        FromPlayerID INT,
        ToPlayerID INT
    )
    
    ;WITH PlayersHasGames AS (
        SELECT
            *,
            CASE WHEN EXISTS (SELECT * FROM Games G WHERE G.PlayerID = P.ID)
                THEN 1 ELSE 0 END
                AS HasGames
        FROM Players P
    ),
    NumberedPlayers AS (
        SELECT
            *,
            ROW_NUMBER() OVER(
                PARTITION BY Name
                ORDER BY HasGames DESC, ID  -- Players with games first
                ) AS RowNum
        FROM PlayersHasGames
    )
    INSERT INTO @PlayerMapping
    SELECT FromPlayer.ID AS FromPlayerID, ToPlayer.ID AS ToPlayerID
    FROM NumberedPlayers FromPlayer
    JOIN NumberedPlayers ToPlayer
        ON ToPlayer.Name = FromPlayer.Name
        AND ToPlayer.RowNum = 1  -- Keeper player
    WHERE FromPlayer.RowNum > 1  -- Any player to be deleted (and mapped)
    
    UPDATE G
    SET PlayerId = PM.ToPlayerID
    FROM Games G
    JOIN @PlayerMapping PM ON PM.FromPlayerID = G.PlayerID
    
    DELETE P
    FROM Players P
    JOIN @PlayerMapping PM ON PM.FromPlayerID = P.ID
    

    Resulting Player Mapping:

    FromPlayerID ToPlayerID
    1 2
    4 3
    5 3
    7 6

    Updated Players:

    ID Name
    2 Oliver
    3 Jack
    6 Harry

    Updated Games (showing player name lookup):

    ID PlayerID PlayerName
    1 2 Oliver
    2 6 Harry
    3 6 Harry

    See this db<>fiddle