I'm trying to find how MarkLogic calculates relevancy score. MarkLogic support pointed me to a knowledge base article (link in reference) where I saw the below formula (natural log).
log(1/term frequency) * log(1/document frequency)
When I apply this formula to my usecase, the formula is always returning a negative value for me. Could anyone provide the final score calculated using the above formula for the below use case?
DB has 350k documents
Document (first result) has 500 words/terms
Document has 5 term matches
DB has 513 documents that matches with the search-term
The formulas for relevance scores are documented in the MarkLogic Search Guide:
The
logtfidf
method (the default scoring method) uses the following formula to calculate relevance:
log(term frequency) * (inverse document frequency)
The inverse document frequency is defined as:
log(1/df)
It seems that the Knowledgebase Article shows the formula for inverse document frequency
when discussing logtfidf
, which might be a little confusing. The intent was to introduce and explain term frequency normalization
and the options that are available to customize the score calculation beyond just the logtfidf
or inverse document frequency
calculation.
With term frequency normalization
you can influence the relevance score with the term frequency normalization setting, which takes into account the size of the document and the "density" of the terms relative to other documents in the database:
The scoring methods that take into account term frequency (
score-logtfidf
andscore-logtf
) will, by default, normalize the term frequency (how many search term matches there are for a document) based on the size of the document. The idea of this normalization is to take into account how frequent a term occurs in the document, relative to the other documents in the database. You can think of this is the density of terms in a document, as opposed to simply the frequency of the terms. The term frequency normalization makes a document that has, for example, 10 occurrences of the word "dog" in a 10,000,000 word document have a lower relevance than a document that has 10 occurrences of the word "dog" in a 100 words document. With the default term frequency normalization of scaled-log, the smaller document would have a higher score (and therefore be more relevant to the search), because it has a greater term density of the word "dog". For most search applications, this behavior is desirable.