Is it possible to use Whoosh to search for documents that do not exactly match the query, but are very close to it? For example, only one word is missing in the query to find something.
I wrote a simple code that works if the query covers all documents:
import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser
if not os.path.exists("index"):
os.mkdir("index")
schema = Schema(title=TEXT(stored=True))
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.commit()
with ix.searcher() as searcher:
parser = QueryParser('title', ix.schema)
myquery = parser.parse(u'TV HD')
results = searcher.search(myquery)
for result in results:
print(result)
Unfortunately, if I change the query to one of the queries below, I won't be able to find all 3 documents (or find none at all):
myquery = parser.parse(u'TV Ultra HD') # 2 Hits
myquery = parser.parse(u'TV 4K Ultra HD') # 1 Hit
myquery = parser.parse(u'TV HD 2022') # 0 Hit
Is it possible to create a parse so that any of these queries still return 3 documents even if the title field is slightly different?
After some thought, I came to the usual enumeration of all combinations of words.
I added a variable tolerance
- this is the maximum number of words that can be cut from the original request. Also added a separate method getResults(words, tolerance)
.
The final code is:
import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser
from whoosh.searching import Results
from itertools import combinations
def getResults(words: list, tol: int) -> Results:
count = len(words)
for tol in range(tolerance):
if count - tol <= 0:
return None
for variant in combinations(words, count - tolerance):
myquery = parser.parse(' '.join(variant))
results = searcher.search(myquery)
if results:
return results
return None
if not os.path.exists("index"):
os.mkdir("index")
schema = Schema(title=TEXT(stored=True, spelling=True))
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.add_document(title=u'TV HD 2022')
writer.commit()
with ix.searcher() as searcher:
parser = QueryParser('title', ix.schema)
words = u'TV HD 2022'.split(' ')
tolerance = 1 # New variable
results = getResults(words, tolerance)
for result in results:
print(result)
The result is 3 Hits:
<Hit {'title': 'TV Ultra HD'}>
<Hit {'title': 'TV HD 2022'}>
<Hit {'title': 'TV 4K Ultra HD'}>
But I consider this a bad decision, because it seems to me that in Whoosh this can be implemented much more concisely