SOLR (4.3) - reducing score of "poor" quality (very short) documents

We're running SOLR 4.3.1, and I have a question about controlling how SOLR scores certain documents.

In some cases, we have indexed documents that are of "poor" quality - the main body (a description field, in this case) may only have 3 or 4 words. Other documents may have much better descriptions. The problem arises when a search is performed, and a the searched term is found in both "good" (longer) and "poor" (shorter) documents.

SOLR seems to score the matches in the shorter documents higher, which makes sense, as the term searched for may be 1 of only 3 or 4 words, so it's a higher percentage than on a document with a longer description, where only 1 or 2 matches are found in 100 words (for example).

Is it possible to somehow penalize or reduce the score on really short documents? I know it's possible that some very short documents are ok, but as a general rule, really short documents in our case are usually "poor quality".

Suggestions?

We're using edismax searching.

Thanks,

Bill

Solution

BM25 Similarity allows you to tune the impact of length-normalization in document scoring. By default, as you've observed, shorter field content outranks longer field content with the same number of term-matches.

You sound like you want to neutralize, or potentially reverse this length normalization process so that field contents of all lengths are considered equivalent with the same number of term-matches.

The two tuning parameters are:

k1 which controls the saturation point for term-frequency (for when you want repeated terms to have greater/lesser influence in score), and

b (the one you want) which controls the influence of content length on match scoring.

If you want to dive deeper, this is a good read on BM25: http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

To get this working, you need to add BM25Similarity to your SOLR schema.xml, either globally, or nested within the definition of the field-type for your description field (recommended, since you may not want this treatment for all of your fields).

<similarity class="solr.BM25SimilarityFactory"> <str name="k1">1.2</str> <str name="b">0.75</str> </similarity> (default values shown)

If you take b down to 0.0 you will effectively negate the impact of length-normalization, meaning two documents each matching the same single query term in the same field will always score equally (regardless of field length) when this field is the only factor considered for scoring.

You'll need to reload your config and reindex your documents for this change to take effect.

You can also try experimenting with a negative b (-0.75 maybe?), as this hypothetically should work to reward longer documents, but I haven't verified this in the current implementation, so please post back if you do get negative b working the way you need.