javahibernatelucenehibernate-search

Hibernate Search with Lucene matches configuration


I'm completely new to Hibernate Search and I'm facing a bug in which the searching is matching 007a7358924e4a60923c6a57f58333bf when the query term is 0001. The field in question is the following:

@FullTextField(analyzer = "edgeNgram")
@Column(name = "serial")
private String serial;

The edgeNgram is declared as:

@Override
public void configure(final LuceneAnalysisConfigurationContext context) {
  context.analyzer("edgeNgram").custom()
      .tokenizer(WhitespaceTokenizerFactory.class)
      .charFilter(HTMLStripCharFilterFactory.class)
      .tokenFilter(ASCIIFoldingFilterFactory.class)
      .tokenFilter(LowerCaseFilterFactory.class)
      .tokenFilter(SnowballPorterFilterFactory.class)
      .tokenFilter(EdgeNGramFilterFactory.class)
      .param("minGramSize", "2")
      .param("maxGramSize", "32");
}

And the matching is done with:

private SearchPredicate matchField(SearchPredicateFactory f, String field, String search) {
  return f.match().field(field).matching(search).toPredicate();
}

I don't know if this bug makes sense, since I suppose this is how this engine works, and the essence of searching is showing you results which are not exact. But this was raised as a bug, and I'm looking for someway to make 0001 or 000 to not match the previous string.

I'm open to include any code that you may find useful. I don't really know how to outline this question in a clearer way.


Solution

  • You should try defining a different analyzer to be applied to your search terms without including the ngram filter:

    @Override
    public void configure(final LuceneAnalysisConfigurationContext context) {
      context.analyzer("edgeNgram").custom()
          .tokenizer(WhitespaceTokenizerFactory.class)
          .charFilter(HTMLStripCharFilterFactory.class)
          .tokenFilter(ASCIIFoldingFilterFactory.class)
          .tokenFilter(LowerCaseFilterFactory.class)
          .tokenFilter(SnowballPorterFilterFactory.class)
          .tokenFilter(EdgeNGramFilterFactory.class)
          .param("minGramSize", "2")
          .param("maxGramSize", "32");
      context.analyzer("searchAnalyzer").custom()
          .tokenizer(WhitespaceTokenizerFactory.class)
          // this one probably also doesn't make sense (unless your search query includes HTML...):
          //.charFilter(HTMLStripCharFilterFactory.class)
          .tokenFilter(ASCIIFoldingFilterFactory.class)
          .tokenFilter(LowerCaseFilterFactory.class)
          .tokenFilter(SnowballPorterFilterFactory.class);
    }
    

    and then in your entity:

    @FullTextField(analyzer = "edgeNgram", searchAnalyzer = "searchAnalyzer")
    @Column(name = "serial")
    private String serial;
    

    what happens is that the same analysis is applied to your search string "0001", and it is tokenized as [00, 000, 0001]; since your document value 007a7358924e4a60923c6a57f58333bf starts with 00 you are getting a match.