pythonpandasfuzzy-searchlocality-sensitive-hashrecord-linkage

Pandas fuzzy detect duplicates


How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

enter image description here

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?


Solution

  • Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.