I've been trying to find duplicates for all cells in a column called "text" that at least match 90% and only keep the first row in case of duplicates found (and remove the rest of the duplicate rows). This should then be displayed in a new csv file.
I have tried to do so with this MWE, however it seems to create 2 new columns called "Matches" and "Combined" that I don't need, as a new csv without the duplicates and with only the first occurence would be the eventual goal.
import pandas as pd
from dedupe_FuzzyWuzzy import deduplication
df = pd.read_csv('/path/input.csv')
# normal duplication drop
df = df.drop_duplicates(subset='text', keep='first')
# threshold drop
df_final = deduplication.deduplication(df, ['text'],threshold=90)
# send output to csv
df_final.to_csv('/path/deduplicated.csv',index=False)
This code, with a basic example, uses rapidfuzz
to mark fuzzy-matched duplications in a text column of a pandas DataFrame. Note: higher threshold means more severe matching. The code goes through a List of text values from the column, checks for fuzzy-duplication and marks for deletion. A deletion list is then used as a mask to remove selected DataFrame rows.
import pandas as pd
import rapidfuzz
txt = ['abc', 'abcdx', 'xyx', 'abcdef', 'xyxg','abcde', 'abxdx', 'xyxk', 'abcdex', '1234', 'abxdx2', '12345']
vals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
df = pd.DataFrame({'col1': txt, 'col2': vals})
def mark_dupes(txt_col, threshold = 90):
txt_list = txt_col.to_list()
marked = [True]*len(txt_list)
for i, txt in enumerate(txt_list):
if not marked[i]: #don't check duplicates of text rows marked for removal
continue
for j, others in enumerate(txt_list[i+1::]):
if marked[i+j+1]: # only look through vals not already marked for removal
if rapidfuzz.fuzz.ratio(txt, others, score_cutoff = threshold):
marked[i+j+1] = False # mark for removal
return marked
chk = mark_dupes(df['col1'], threshold = 80)
dfx = df[chk]
print(dfx)
which prints
col1 col2
0 abc 1
2 xyx 3
3 abcdef 4
6 abxdx 7
9 1234 10