Goal: execute fuzzy search, then wildcard search with those similar terms
I have a boolean query in place at the moment, shown below:
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$pattern = new Zend_Search_Lucene_Index_Term("*$string*");
$subquery1 = new Zend_Search_Lucene_Search_Query_Wildcard($pattern);
$term = new Zend_Search_Lucene_Index_Term("$string");
$subquery2 = new Zend_Search_Lucene_Search_Query_Fuzzy($term);
$query->addSubquery($subquery1, null /* optional */);
$query->addSubquery($subquery2, null /* optional */);
$hits = $index->find($query);
This seems to be executing an either/or search. For example: if I search for the term
"berry"
I hit everything with "berry" anywhere in the title
berry, wild berry, strawberry, blueberry
But if I search for
"bery"
I only hit results like
berry
I'm not exactly sure how the fuzzy search is powered. Is there a way to modify my query so that I can wildcard search after the fuzzy search returns the similar terms?
I suspect that field is not analyzed when indexed.
So, with the first query, you are getting hits from the wildcard query. *berry*
matches all of the examples you've given. *bery*
doesn't match any of the documents, though, since it's not actually a substring of any of them.
For the fuzzy query, terms are compared by edit distance (Damerau–Levenshtein distance). An edit distance of two is the default maximum for a match.
bery
to berry
- edit distance: 1bery
to wild berry
- edit distance: 6bery
to strawberry
- edit distance: 6bery
to blueberry
- edit distance: 5This could be handled in part by using an analyzer, instead of indexing the entire string as a single token. Standard analyzer would split wild berry
up into the tokens wild
and berry
, and you could expect a fuzzy match on that.
As far as strawberry and blueberry, unless your analyzer splits apart straw
and berry
somehow, you could manually specify terms to split apart by incorporating a SynonymFilter
into your analyzer.
Another option would be to attempt to correct the query spelling before searching, using lucene's SpellChecker