I'm trying to add the Lucene.NET Highlighter to my search, however it's doing some really strange highlighting, what am I doing wrong?
Here's the highlighting code:
// stuff here to get scoreDocs
var content = doc.GetField("content").StringValue();
// content = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been"
var highlighter = new Highlighter(new StrongFormatter(), new HtmlEncoder(), new QueryScorer(query.Rewrite(indexSearcher.GetIndexReader())));
highlighter.SetTextFragmenter(new SimpleFragmenter(100));
var tokenStream = analyzer.TokenStream("content", new StringReader(content));
var bestFragment = highlighter.GetBestFragment(tokenStream, content);
Searching for "lorem"
gives me this bestFragment value:
<strong>Lorem</strong> <strong>Ipsum</strong> is <strong>simply</strong> <strong>dummy</strong> <strong>text</strong> of the <strong>printing</strong> and <strong>typesetting</strong> <strong>industry</strong>. <strong>Lorem</strong> <strong>Ipsum</strong> <strong>has</strong> <strong>been</strong>
As you can see, its highlighted much more than just "Lorem"
. Why?
How do I make this behave sensibly?
I'm using a StandardAnalyzer
and my query looks like "content:lorem"
Edit: I'm using Lucene.NET 2.9.2
You haven't submitted your implementation of StrongFormatter
or HtmlEncoder
, but I would say that your implementation error is in the first one. It needs to check the score of the passed TokenGroup
to decide if any formatting is needed.
public class StrongFormatter : Formatter {
public String HighlightTerm(String originalText, TokenGroup tokenGroup) {
var score = tokenGroup.GetTotalScore();
if (score == 0)
return originalText;
return String.Concat("<strong>", originalText, "<strong>");
}
}
However, you're not the first one that wants to wrap matches in a html element. You could just use the SimpleHTMLFormatter
formatter that comes with Highlighter.Net. And while at it, there's also a SimpleHTMLEncoder
which probably does what your HtmlEncoder does.