pythonpandasdataframeanti-join

SQL query equivalent in python to: delete only one identical record in a dataframe


Problem-stmt: To Delete a column from a table[MDF] based on a condition[OP_DIRECTIVE = 'D'] from a different table[INC]. Both the tables have identical columns.

I am looking for a SQL Equivalent Query in Python to Delete one identical record even if there are multiple records match the DELETE condition

I've written the same in SQL [working]:- Approach to my solution: Copy unmatched rows to a work table, then truncate the original table and replace with contents of the work table. One way to identify unmatched rows would be to tag each of the input rows in a set of duplicates with a unique number, something like this:

INSERT work_table SELECT MI.col1, MI.col2, ...
FROM 
  (SELECT M.*,
   ROW_NUMBER() OVER (PARTITION BY <join cols> ORDER BY <some col(s)>) AS ROWNUM
   FROM MORTALITY M) MI
LEFT JOIN 
  (SELECT I.*, 
   ROW_NUMBER() OVER (PARTITION BY <join cols> ORDER BY <some col(s)>) AS ROWNUM
   FROM INC_MORTALITY I
   WHERE OP_DIRECTIVE='D') INC
ON MI.join_col1 = INC.join_col1
AND MI.join_col2 = INC.join_col2
... all the columns except for 'OP_DIRECTIVE'
AND MI.ROWNUM = INC.ROWNUM
WHERE INC.ROWNUM IS NULL /* "anti-join" keeps only unmatched rows */
;
DELETE FROM MORTALITY;
INSERT MORTALITY SELECT * FROM work_table;

What i've tried:

import os
import time
import pandas as pd

filePath = '/Users/test_files'
timestr = time.strftime("%Y-%m-%d-%-H%M%S")
fileName = ‘left_join' + timestr + '.txt'

if os.path.exists(filePath):
        MDF = pd.read_csv(“mdf.txt", sep='|')
        INC = pd.read_csv(“inc.txt”, sep='|')
        
       result = MDF.merge(
                    INC_D,
                    on=['data_source','dd_imp_flag','dob','dod','death_verification','gender_probability','gender','token_1','token_2','token_4','token_5','token_7','token_16','token_key'],
                    how = 'left',
                    suffixes=('', '_delme'))
        cols = result.columns.difference(MDF.columns)
        result = result.loc[result[cols].isnull().all(axis=1), MDF.columns.tolist()]

        result.to_csv(os.path.join(filePath, fileName), sep="|", index=False)  # remove header=None if header is needed
        print("Data export successful.")
else:
    print("File path does not exist.")

But this deletes all the records matching 'D' as an indicator, clearly i am missing ROW_NUMBER here, so i am wondering how to achieve it in python

MDF-previously

data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-09-01|3|1.0|F|Vznnb7W7VcSvM6bdKbDLyKXcv/UK9FYxfQEWSf7WU1s=|2ye4lajQ4v2lzl5P0sJnUExn8uMMjjWw3vInwUFjx50=|geZFT7Ea5O8rwGwJi17dL9EggYY+ahpfEv5hqz8f/K4=|cT8lopT3v+qvNykrv5N0/hNQdVzEBWt0wz8V01L197Q=|fSkPNkTewOiC+o7ahtH/6YvOx6MJ2Tr36gHyZYBFiNU=|cyusBFir8H19NvWjBYSriCIivL2KVqzFtJkSWSciYFM=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY

INC

data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|D|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|D|0|1940-12-01|1859-09-01|3|1.0|F|Vznnb7W7VcSvM6bdKbDLyKXcv/UK9FYxfQEWSf7WU1s=|2ye4lajQ4v2lzl5P0sJnUExn8uMMjjWw3vInwUFjx50=|geZFT7Ea5O8rwGwJi17dL9EggYY+ahpfEv5hqz8f/K4=|cT8lopT3v+qvNykrv5N0/hNQdVzEBWt0wz8V01L197Q=|fSkPNkTewOiC+o7ahtH/6YvOx6MJ2Tr36gHyZYBFiNU=|cyusBFir8H19NvWjBYSriCIivL2KVqzFtJkSWSciYFM=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|D|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY

MDF-after-updates-expected output

data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY

Solution

  • With your explanation in the comments and the sentence "Warning: please ensure you only delete, or mark as deleted, one record, even if more than one historical record fully matches this new delete record", I believe a simple way to achieve your result is to use duplicated that would mark as True all duplicated rows, starting the second duplicates

    result.loc[result[cols].isnull().all(axis=1)
               |result.duplicated(subset=MDF.columns, keep='first'), # add this condition
               MDF.columns.tolist()]