python-3.xmachine-learningnlpinformation-retrieval

Problems understanding nDCG format in pytrec_eval?


I am using pytrec_eval to calculate nDCG scores. For example, for qrel:

qrel = {
    'q1': {
        'd1': 0,
        'd2': 1,
        'd3': 0,
    }
}

And run:

run = {
    'q1': {
        'd1': 1.0,
        'd2': 0.0,
        'd3': 1.5,
    }
}

The nDCG score can be calculated like this:

import pytrec_eval
import json    
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'ndcg'})

print(json.dumps(evaluator.evaluate(run), indent=1))


 "q1": {
  "ndcg": 0.5
 }

My understanding is that nDCG takes into account the order of the retrieved documents indices, however, if you change the document ordering in the run you still get the same nDCG score, for example:

run2 = {
    'q1': {
        'd1': 1.0,
        'd3': 1.5,
        'd2': 0.0,

    }
}

evaluator = pytrec_eval.RelevanceEvaluator(qrel, {'ndcg'})
print(json.dumps(evaluator.evaluate(run2), indent=1))

Is this the expected behavior of calculating nDCG? What is the usage of the qrel? My undertanding is that the qrel tells you how relevant is the retrieved document, while run, is the resulting ranking of your query and IR system. Then, why if I change the order of run the nDCG score is the same?


Solution

  • In order to compute the nDCG, you need to know what is the relevance of each document in a ranked list of results for this query. This information is contained in qrels. Ranking means that you first need to sort retrieved documents in descending order of their score. So you basically sort documents, and then go rank by rank, starting from the lowest rank (i.e. the top-scored document), which is 1. For each rank i you get the document's ground-truth relevance rel_i from qrels, and then you divide this relevance by log_2(i+1) to get a term for this rank i. You sum all these terms across all ranks, and you get the Discounted Cumulative Gain (DCG) for this query.

    Therefore, pytrec_eval internally needs to create a sorted list from the dictionary mapping from doc ID to score, in order to get the ranks. This is why the order of the document-score pairs in the dictionary you pass as an input doesn't matter. Now, as an additional detail: to get the nDCG (i.e. normalized DCG), you divide the DCG by the Ideal DCG, which is the maximum DCG achievable by any model; to get the IDCG, you sort the ground-truth (as opposed to retrieved) documents in descending order of their relevance score and compute the DCG. Again, to get the ground-truth relevance scores you need qrels.