Problem-stmt: To Delete a column from a table[MDF] based on a condition[OP_DIRECTIVE = 'D'] from a different table[INC]. Both the tables have identical columns.
I am looking for a SQL Equivalent Query in Python to Delete one identical record even if there are multiple records match the DELETE condition
I've written the same in SQL [working]:- Approach to my solution: Copy unmatched rows to a work table, then truncate the original table and replace with contents of the work table. One way to identify unmatched rows would be to tag each of the input rows in a set of duplicates with a unique number, something like this:
INSERT work_table SELECT MI.col1, MI.col2, ...
FROM
(SELECT M.*,
ROW_NUMBER() OVER (PARTITION BY <join cols> ORDER BY <some col(s)>) AS ROWNUM
FROM MORTALITY M) MI
LEFT JOIN
(SELECT I.*,
ROW_NUMBER() OVER (PARTITION BY <join cols> ORDER BY <some col(s)>) AS ROWNUM
FROM INC_MORTALITY I
WHERE OP_DIRECTIVE='D') INC
ON MI.join_col1 = INC.join_col1
AND MI.join_col2 = INC.join_col2
... all the columns except for 'OP_DIRECTIVE'
AND MI.ROWNUM = INC.ROWNUM
WHERE INC.ROWNUM IS NULL /* "anti-join" keeps only unmatched rows */
;
DELETE FROM MORTALITY;
INSERT MORTALITY SELECT * FROM work_table;
What i've tried:
import os
import time
import pandas as pd
filePath = '/Users/test_files'
timestr = time.strftime("%Y-%m-%d-%-H%M%S")
fileName = ‘left_join' + timestr + '.txt'
if os.path.exists(filePath):
MDF = pd.read_csv(“mdf.txt", sep='|')
INC = pd.read_csv(“inc.txt”, sep='|')
result = MDF.merge(
INC_D,
on=['data_source','dd_imp_flag','dob','dod','death_verification','gender_probability','gender','token_1','token_2','token_4','token_5','token_7','token_16','token_key'],
how = 'left',
suffixes=('', '_delme'))
cols = result.columns.difference(MDF.columns)
result = result.loc[result[cols].isnull().all(axis=1), MDF.columns.tolist()]
result.to_csv(os.path.join(filePath, fileName), sep="|", index=False) # remove header=None if header is needed
print("Data export successful.")
else:
print("File path does not exist.")
But this deletes all the records matching 'D' as an indicator, clearly i am missing ROW_NUMBER
here, so i am wondering how to achieve it in python
MDF-previously
data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-09-01|3|1.0|F|Vznnb7W7VcSvM6bdKbDLyKXcv/UK9FYxfQEWSf7WU1s=|2ye4lajQ4v2lzl5P0sJnUExn8uMMjjWw3vInwUFjx50=|geZFT7Ea5O8rwGwJi17dL9EggYY+ahpfEv5hqz8f/K4=|cT8lopT3v+qvNykrv5N0/hNQdVzEBWt0wz8V01L197Q=|fSkPNkTewOiC+o7ahtH/6YvOx6MJ2Tr36gHyZYBFiNU=|cyusBFir8H19NvWjBYSriCIivL2KVqzFtJkSWSciYFM=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
INC
data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|D|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|D|0|1940-12-01|1859-09-01|3|1.0|F|Vznnb7W7VcSvM6bdKbDLyKXcv/UK9FYxfQEWSf7WU1s=|2ye4lajQ4v2lzl5P0sJnUExn8uMMjjWw3vInwUFjx50=|geZFT7Ea5O8rwGwJi17dL9EggYY+ahpfEv5hqz8f/K4=|cT8lopT3v+qvNykrv5N0/hNQdVzEBWt0wz8V01L197Q=|fSkPNkTewOiC+o7ahtH/6YvOx6MJ2Tr36gHyZYBFiNU=|cyusBFir8H19NvWjBYSriCIivL2KVqzFtJkSWSciYFM=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|D|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
MDF-after-updates-expected output
data_source|op_directive|dd_imp_flag|dod|dob|death_verification|gender_probability|gender|token_1|token_2|token_4|token_5|token_7|token_16|token_key
OBIT^SSA|A|1|1931-12-06|1978-03-31|5|0.6735|M|3i5HbesGaxZKeHTAzQkeDskr3YTEyMhcm2zpOQUexog=|UqskLHepjFVSIYGTlpsezOi30eTDh4VrX9H87ynifX6=|6E8hQBwm9Ylwszv6LJwyGN1TF18y8hRubFHe4pLwE03=|SoU4pSpEFZhtUROME0rFwlqnRDb5gfHlcCnlTZLuPQv=|yc499QG3ItyqRtqr8bKFtZ4WRaOBwAZzP5Pmd1ChTUF=|zJBxzxwqZVY66finpsmtRfuzBqeQ2N0FhMGyWmoxB07=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
SSA|A|0|1940-12-01|1859-10-01|3|0.0|F|4pxtVDIKcDdiSZqgMNlI5rILQCmm0RhgScJ2E84+BwI=|KyNwahEN6lCvxGBxAOXjYO/QM0Z0QcfI7kPtcEITS4s=|wzyHav4A370qgBk8wPn2AaJyMHMtdFJDCTFhLog9wkI=|hohND7ZFlO9ug14Vei2ESXNy9eqYT47DbiI9J2v+ljQ=|8Plp87L0cC6gdlVbaE0YYzSoe46oIbR/YccdfFGtgd8=|Sb6pUg1X7R7nJONwRrMbWYZ8rMi2TRSkriYHawx2vNE=|cigna-datavant_TOKEN_ENCRYPTION_KEY
With your explanation in the comments and the sentence "Warning: please ensure you only delete, or mark as deleted, one record, even if more than one historical record fully matches this new delete record", I believe a simple way to achieve your result is to use duplicated
that would mark as True
all duplicated rows, starting the second duplicates
result.loc[result[cols].isnull().all(axis=1)
|result.duplicated(subset=MDF.columns, keep='first'), # add this condition
MDF.columns.tolist()]