pythonfuzzy-searchrapidfuzz

How to make fuzzy search between lists showing matches and not found elements?


I'm trying to make a fuzzy match for the values in list to_search. Search each value in to_search within choices list and show the corresponding item from result list. Like a MS Excel VLookUp, but with fuzzy search.

This is my current code that almost print the correct output for different values, but for the values of to_search that don't have any similarity within choices, I´d like to show in output Not Found, but currently I'm getting some other output.

The search I´m looking for is by prefix, this is in the same order. For example the value 38050 appears with this match ('358', 72.0, 8) because 3, 5 and 8 are present in 38050, but for me is not of interest since 3, 5 and 8 are in different order. Would be a match for me if the choice found would be 380XX, that at least has similarity in prefix compared with 38050. I hope you understand my explanation.

from rapidfuzz import process, fuzz

choices = [
        '237','1721','124622','334','124624','124','1246','1876','358',
        '33751','33679','599','61','230','31','65','1721','1','124623'
    ]

result = [
            'NAD','ATE','STA','SSI','GYP','RIC','EEC','AND','GIU','ANC',
            'PAI','GAR','TAL','ANI','LAN','TRI','GDO','MAR','EDE'
        ]

to_search = ['18763044','187635','23092','3162','38050','33','49185','51078','1246','1721']

for element in to_search:
    match =  process.extractOne(element, choices, scorer=fuzz.WRatio)
    print(element,result[match[2]],'         ## ',match)

Current output

>>>
18763044    AND         ##  ('1876', 90.0, 7)
187635      AND         ##  ('1876', 90.0, 7)
23092       ANI         ##  ('230', 90.0, 13)
3162        LAN         ##  ('31', 90.0, 14)
38050       GIU         ##  ('358', 72.0, 8) // This should be marked as NOT FOUND
33          SSI         ##  ('334', 90.0, 3)
49185       MAR         ##  ('1', 90.0, 17)  // This should be marked as NOT FOUND
51078       MAR         ##  ('1', 90.0, 17)  // This should be marked as NOT FOUND
1246        EEC         ##  ('1246', 100.0, 6)
1721        ATE         ##  ('1721', 100.0, 1)

The output I'm trying to get:

18763044    AND
187635      AND
23092       ANI
3162        LAN
38050       NOT FOUND
33          SSI
49185       NOT FOUND
51078       NOT FOUND
1246        EEC
1721        ATE

In table format for easy understanding of inputs and output. Thanks in advance

enter image description here


Solution

  • You can create you own scorer:

    def my_scorer(query,choice,**kwargs):
        # default score
        score=fuzz.WRatio(query,choice)
    
        is_prefix=False
    
        # if choice is a prefix of query
        if choice in query and query.index(choice)==0:
            is_prefix=True
    
        # if query is a prefix of choice
        if query in choice and choice.index(query)==0:
            is_prefix=True
    
        if not is_prefix:
            score=-1
    
        return score
    

    And pass it to process.extractOne along with score_cutoff=0 to ignore results with score lower than 0:

    match = process.extractOne(element, choices, scorer=my_scorer, score_cutoff=0) 
    if match:
      print(element, result[match[2]])
    else:
      print(element, 'NOT FOUND')