pythoninformation-retrievalfuzzy-searchwhoosh

Fuzzy String Searching with Whoosh in Python


I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank & Trust Co of Missouri' and 'Eagle Bank and Trust Company of Missouri'. The following code works with simple fuzzy such, but cannot achieve a match on the above:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()

test_items = [u"Eagle Bank and Trust Company of Missouri"]

writer.add_document(name=item)
writer.commit()

from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm

with ix.searcher() as s:
    qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
    q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
    results = s.search(q)
    print results

gives me:

<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>

Is it possible to achieve what I want with Whoosh? If not what other python based solutions do I have?


Solution

  • You could match Co with Company using Fuzzy Search in Whoosh but You shouldn't do because the difference between Co and Company is large. Co is similar to Company as Be is similar to Beast and ny to Company, You can imagine how bad and how large will be the search results.

    However, if you want to match Compan or compani or Companee to Company you could do it by using a Personalized Class of FuzzyTerm with default maxdist equal to 2 or more :

    maxdist – The maximum edit distance from the given text.

    class MyFuzzyTerm(FuzzyTerm):
         def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
             super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
    

    Then:

     qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)
    

    You could match Co with Company by setting maxdist to 5 but this as I said give bad search results. I suggest to keep maxdist from 1 to 3.

    If you are looking for matching a word linguistic variations, you better use whoosh.query.Variations.

    Note: older Whoosh versions has minsimilarity instead of maxdist.