I've built up a large database of banks in MongoDB. I can easily take this information and create indexes with it in whoosh. For example I'd like to be able to match the bank names 'Eagle Bank & Trust Co of Missouri' and 'Eagle Bank and Trust Company of Missouri'. The following code works with simple fuzzy such, but cannot achieve a match on the above:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()
test_items = [u"Eagle Bank and Trust Company of Missouri"]
writer.add_document(name=item)
writer.commit()
from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm
with ix.searcher() as s:
qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
results = s.search(q)
print results
gives me:
<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>
Is it possible to achieve what I want with Whoosh? If not what other python based solutions do I have?
You could match Co
with Company
using Fuzzy Search in Whoosh but You shouldn't do because the difference between Co
and Company
is large. Co
is similar to Company
as Be
is similar to Beast
and ny
to Company
, You can imagine how bad and how large will be the search results.
However, if you want to match Compan
or compani
or Companee
to Company
you could do it by using a Personalized Class of FuzzyTerm
with default maxdist
equal to 2 or more :
maxdist – The maximum edit distance from the given text.
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
Then:
qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)
You could match Co
with Company
by setting maxdist
to 5
but this as I said give bad search results. I suggest to keep maxdist
from 1
to 3
.
If you are looking for matching a word linguistic variations, you better use whoosh.query.Variations
.
Note: older Whoosh versions has minsimilarity
instead of maxdist
.