python performance fuzzy-search fuzzy-comparison

Fuzzy comparison of strings in lists of huge length (taking into account performance)

I have two lists:

The first list I get from the database is the names of various companies (can be written in uppercase, lowercase or a combination)

list_from_DB = ["Reebok", "MAZDA", "PATROL", "AsbEngland-bank", "Mazda INCC", "HIGHWAY lcc", "mazda", "015 Amazon", ......]

There are about 400,000 items in this list.

The second list I get the by parsing text from the user (there can be absolutely any words, numbers, signs, etc.)

list_from_user = ['ts', '$243', 'mazda', 'OFFICERS', 'SIGNATURE', 'Date:07/20/2022', 'Wilson', 'Bank', .......]

There are about 1000 items in this list.

What I need to do is find which items from list_from_user are in list_from_DB and display them in the order of the greatest similarity. As you can see below, the items in the two lists may be identical, or they may differ in spelling.

Output

["mazda", "MAZDA", "Mazda INCC", "AsbEngland-bank"]

What I do: yes, I know about fuzzy character matching libraries, I use rapidfuzz.

res = []
for e in list_from_user:
    r = rapidfuzz.process.extract_iter(e, list_from_DB, processor=str.lower, scorer=rapidfuzz.fuzz.ratio, score_cutoff=95)
    res += r

Yes, the result is working, but very long, about 30 seconds, since the loop must perform 1000 * 400.000 = 400.000.000 operations.

Therefore, the question is the following: is it possible to solve this problem without enumeration of all options, but in some other way? (I'm not against the method with enumeration of all options, but if it fits in time)

My time target is 3 seconds max.

Solution

If the result of what you are currently doing meets your needs, then cdist is a better choice.

The code below is how to save everything we can get from cdist.

import numpy as np
import rapidfuzz


def find_fuzzy(list_from_user, list_from_DB, score_cutoff: int):
    score_matrix = rapidfuzz.process.cdist(
        list_from_user,
        list_from_DB,
        processor=str.lower,
        scorer=rapidfuzz.fuzz.ratio,
        dtype=np.uint8,  # Output the score as uint8, which is faster.
        workers=-1,  # Use multithreading. -1 means use all cores.
        score_cutoff=score_cutoff,
    )

    results = []
    user_indices, db_indices = np.nonzero(score_matrix)
    for user_index_of_match, db_index_of_match in zip(user_indices, db_indices):
        results.append(
            {
                "user_index_of_match": user_index_of_match,
                "db_index_of_match": db_index_of_match,
                "user_item_of_match": list_from_user[user_index_of_match],
                "db_item_of_match": list_from_DB[db_index_of_match],
                "score_of_match": score_matrix[user_index_of_match, db_index_of_match],
            }
        )
    return results