rcumulative-sum

How to compute Normalized Discounted Gain in R


I am having a hard time understanding how to compute nDCG for my data. I am relatively new to R and I am struggling to translate the formulas to real practice in R. I have a record of 50 columns and I am only interested in 5 columns that have my relevancy score for search results. I would like to compute nDCG from these 5 columns which represent different ranks. I managed to compute the DCG with the equation below:

data1$DCG <- data1$rank1/log2 (1+1) + data1$rank2/log2(2+1) + data1$rank3/log2(3+1) + data1$rank4/log2(4+1) + data1$rank5/log2(5+1)

My question is how do I continue to compute the ideal DCG which is an ordered DCG and compute it in a loop for all rows?

nDCG is relatively simple and I will manage, I just need assistance with the ideal DCG, here is the formula:

enter image description here

Can someone please assist me?


Solution

  • This answer is intended to show how the formulas can be implemented in R.

    From the Wikipedia's article "Discounted cumulative gain", I scrapped the example as toy data:

    library(rvest)
    table <- "https://en.wikipedia.org/wiki/Discounted_cumulative_gain" |>
      read_html() |>
      html_table() |>
      { \(x) x[[2L]] }() |>
      as.data.frame() |>
      `colnames<-`(c("i", "rel_i", "log_2(i+1)", "bracket"))
    

    gives

    > table
      i rel_i log_2(i+1) bracket
    1 1     3      1.000   3.000
    2 2     2      1.585   1.262
    3 3     3      2.000   1.500
    4 4     0      2.322   0.000
    5 5     1      2.585   0.387
    6 6     2      2.807   0.712
    

    Reshaping the data to meet my assumptions about the structure of the OP's data.

    toy_data <- as.data.frame(t(table[, c("rel_i")])) |>
      `colnames<-`(paste0("rank", 1L:6L))
    

    Note, my assumptions can be wrong. OP does not provide data. This gives

    > toy_data
      rank1 rank2 rank3 rank4 rank5 rank6
    1     3     2     3     0     1     2
    
    

    Let's wrap the steps to compute "DCG", "IDCG", "nDCG" (definitions taken from Wikipedia) in a small function called calc:

    calc <- \(df, k) {
      d <- log2(1L:k + 1L) 
      DCG <- sum(df[1L, 1L:k] / d)
      IDCG <- sum(sort(as.numeric(df[1L, ]), decreasing = TRUE)[1L:k] / d)
      nDCG <- DCG / IDCG
      list("DCG" = DCG, "IDCG" = IDCG, "nDCG" = nDCG)
    }
    

    Let's rerun the extended example stated in the Wikipedia article to check if the implementation is correct.

    1. Modfiy data
    # add rank7, rank8 to meet modifications stated @Wikipedia: 
    toy_data$rank7 <- 3L
    toy_data$rank8 <- 2L
    
    1. Calculate results for k = 6:
    > calc(df = toy_data, k = 6L)
    $DCG
    [1] 6.861127
    
    $IDCG
    [1] 8.740262
    
    $nDCG
    [1] 0.7850024
    

    Identical results. Just round to two digits.


    "Thanks. But what is 1L and in my case where my columns are only a subset of the whole data frame, how do I do it?"

    (1) Read this. (2) As already mentioned, you do not provide data. We do not know the data structure.

    Some minor modifications which allows us to apply the function to a data frame (data and structure given below),

    calc2 <- \(df, k) {
      d <- log2(1L:k + 1L)
      DCG <- apply(df, 1L, \(x) sum(x[1L:k] / d))
      # lets put t() in / remove as.numeric()
      IDCG <- apply(df, 1L, \(x) sum(t(sort(x, decreasing = TRUE)[1L:k] / d)))
      cbind(df, "DCG" = DCG, "IDCG" = IDCG, "nDCG" = DCG / IDCG)
    }
    

    and apply:

    > calc2(df = data[, grepl("rank", colnames(data))], 
            # use all cols containing "rank" in name
            k = 6L)
      rank1 rank2 rank3 rank4 rank5 rank6 rank7 rank8      DCG     IDCG      nDCG
    1     3     2     3     0     1     2     3     2 6.861127 8.740262 0.7850024
    2     2     2     2     0     0     3     3     2 5.330481 8.240262 0.6468824
    3     3     2     1     2     2     3     3     3 7.465540 9.170939 0.8140431
    4     1     3     0     2     3     2     1     2 5.627115 7.884055 0.7137336
    5     3     2     3     2     3     2     2     3 8.496185 9.170939 0.9264248
    

    Data:

    # the data is generated by resampling (replace = T) the Wiki example 
    data <- structure(list(`rank1` = c(3, 2, 3, 1, 3), 
                           `rank2` = c(2, 2, 2, 3, 2), 
                           `rank3` = c(3, 2, 1, 0, 3), 
                           `rank4` = c(0, 0, 2, 2, 2), 
                           `rank5` = c(1, 0, 2, 3, 3), 
                           `rank6` = c(2, 3, 3, 2, 2), 
                           `rank7` = c(3, 3, 3, 1, 2), 
                           `rank8` = c(2, 2, 3, 2, 3),
                           `blabla` = c(9, 5, 6, 7, 3)), 
                      row.names = c(NA, -5L), class = "data.frame")