nlpstring-matchingfuzzy-comparison

Fuzzy string matching in Python for structured strings?


I have a Python implementation of fuzzy matching using the Levenshtein similarity. I'm pretty happy with it but I feel I'm leaving a lot on the table by not considering the structure of the strings.

Here are some examples of matches that are clearly good, but not captured well by Levenshtein :

I think some normalization ahead of using Levenshtein would be good - eg. replace all & with and, remove punctuation, etc... not sure I want to jump straight to stop-word removal and lematization, but something along those line

To avoid re-inventing the wheel, is there any easy way to do this? Or an alternative to levenshtine that addresses these issues (short of some Bert embeddings)


Solution

  • rapidfuzz.utils.default_process might be an option to consider for preprocessing.

    rapidfuzz.utils.default_process(sentence: str) → str This function preprocesses a string by:

    • removing all non alphanumeric characters
    • trimming whitespaces
    • converting all characters to lower case

    PARAMETERS: sentence (str) – String to preprocess

    RETURNS: processed_string – processed string

    RETURN TYPE: str

    https://maxbachmann.github.io/RapidFuzz/Usage/utils.html