lucene.netlucene-highlighter

Lucene.NET highlighter plugin highlighting strangely


I'm trying to add the Lucene.NET Highlighter to my search, however it's doing some really strange highlighting, what am I doing wrong?

Here's the highlighting code:

// stuff here to get scoreDocs

var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"

  
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));

var bestFragment = highlighter.GetBestFragment(tokenStream, content);

Searching for "lorem" gives me this bestFragment value:

<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>

As you can see, its highlighted much more than just "Lorem". Why?

How do I make this behave sensibly?

I'm using a StandardAnalyzer and my query looks like "content:lorem"

Edit: I'm using Lucene.NET 2.9.2


Solution

  • You haven't submitted your implementation of StrongFormatter or HtmlEncoder, but I would say that your implementation error is in the first one. It needs to check the score of the passed TokenGroup to decide if any formatting is needed.

    public class StrongFormatter : Formatter {
        public String HighlightTerm(String originalText, TokenGroup tokenGroup) {
            var score = tokenGroup.GetTotalScore();
            if (score == 0)
                return originalText;
    
            return String.Concat("<strong>", originalText, "<strong>");
        }
    }
    

    However, you're not the first one that wants to wrap matches in a html element. You could just use the SimpleHTMLFormatter formatter that comes with Highlighter.Net. And while at it, there's also a SimpleHTMLEncoder which probably does what your HtmlEncoder does.