I'm completely new to Hibernate Search and I'm facing a bug in which the searching is matching 007a7358924e4a60923c6a57f58333bf
when the query term is 0001
. The field in question is the following:
@FullTextField(analyzer = "edgeNgram")
@Column(name = "serial")
private String serial;
The edgeNgram
is declared as:
@Override
public void configure(final LuceneAnalysisConfigurationContext context) {
context.analyzer("edgeNgram").custom()
.tokenizer(WhitespaceTokenizerFactory.class)
.charFilter(HTMLStripCharFilterFactory.class)
.tokenFilter(ASCIIFoldingFilterFactory.class)
.tokenFilter(LowerCaseFilterFactory.class)
.tokenFilter(SnowballPorterFilterFactory.class)
.tokenFilter(EdgeNGramFilterFactory.class)
.param("minGramSize", "2")
.param("maxGramSize", "32");
}
And the matching is done with:
private SearchPredicate matchField(SearchPredicateFactory f, String field, String search) {
return f.match().field(field).matching(search).toPredicate();
}
I don't know if this bug makes sense, since I suppose this is how this engine works, and the essence of searching is showing you results which are not exact. But this was raised as a bug, and I'm looking for someway to make 0001
or 000
to not match the previous string.
I'm open to include any code that you may find useful. I don't really know how to outline this question in a clearer way.
You should try defining a different analyzer to be applied to your search terms without including the ngram filter:
@Override
public void configure(final LuceneAnalysisConfigurationContext context) {
context.analyzer("edgeNgram").custom()
.tokenizer(WhitespaceTokenizerFactory.class)
.charFilter(HTMLStripCharFilterFactory.class)
.tokenFilter(ASCIIFoldingFilterFactory.class)
.tokenFilter(LowerCaseFilterFactory.class)
.tokenFilter(SnowballPorterFilterFactory.class)
.tokenFilter(EdgeNGramFilterFactory.class)
.param("minGramSize", "2")
.param("maxGramSize", "32");
context.analyzer("searchAnalyzer").custom()
.tokenizer(WhitespaceTokenizerFactory.class)
// this one probably also doesn't make sense (unless your search query includes HTML...):
//.charFilter(HTMLStripCharFilterFactory.class)
.tokenFilter(ASCIIFoldingFilterFactory.class)
.tokenFilter(LowerCaseFilterFactory.class)
.tokenFilter(SnowballPorterFilterFactory.class);
}
and then in your entity:
@FullTextField(analyzer = "edgeNgram", searchAnalyzer = "searchAnalyzer")
@Column(name = "serial")
private String serial;
what happens is that the same analysis is applied to your search string "0001", and it is tokenized as [00, 000, 0001]
; since your document value 007a7358924e4a60923c6a57f58333bf
starts with 00
you are getting a match.