How to compute Normalized Discounted Gain in R

I am having a hard time understanding how to compute nDCG for my data. I am relatively new to R and I am struggling to translate the formulas to real practice in R. I have a record of 50 columns and I am only interested in 5 columns that have my relevancy score for search results. I would like to compute nDCG from these 5 columns which represent different ranks. I managed to compute the DCG with the equation below:

data1$DCG <- data1$rank1/log2 (1+1) + data1$rank2/log2(2+1) + data1$rank3/log2(3+1) + data1$rank4/log2(4+1) + data1$rank5/log2(5+1)

My question is how do I continue to compute the ideal DCG which is an ordered DCG and compute it in a loop for all rows?

nDCG is relatively simple and I will manage, I just need assistance with the ideal DCG, here is the formula:

Can someone please assist me?

Solution

This answer is intended to show how the formulas can be implemented in R.

From the Wikipedia's article "Discounted cumulative gain", I scrapped the example as toy data:

library(rvest)
table <- "https://en.wikipedia.org/wiki/Discounted_cumulative_gain" |>
  read_html() |>
  html_table() |>
  { \(x) x[[2L]] }() |>
  as.data.frame() |>
  `colnames<-`(c("i", "rel_i", "log_2(i+1)", "bracket"))

gives

> table
  i rel_i log_2(i+1) bracket
1 1     3      1.000   3.000
2 2     2      1.585   1.262
3 3     3      2.000   1.500
4 4     0      2.322   0.000
5 5     1      2.585   0.387
6 6     2      2.807   0.712

Reshaping the data to meet my assumptions about the structure of the OP's data.

toy_data <- as.data.frame(t(table[, c("rel_i")])) |>
  `colnames<-`(paste0("rank", 1L:6L))

Note, my assumptions can be wrong. OP does not provide data. This gives

> toy_data
  rank1 rank2 rank3 rank4 rank5 rank6
1     3     2     3     0     1     2

Let's wrap the steps to compute "DCG", "IDCG", "nDCG" (definitions taken from Wikipedia) in a small function called calc:

calc <- \(df, k) {
  d <- log2(1L:k + 1L) 
  DCG <- sum(df[1L, 1L:k] / d)
  IDCG <- sum(sort(as.numeric(df[1L, ]), decreasing = TRUE)[1L:k] / d)
  nDCG <- DCG / IDCG
  list("DCG" = DCG, "IDCG" = IDCG, "nDCG" = nDCG)
}

Let's rerun the extended example stated in the Wikipedia article to check if the implementation is correct.

Modfiy data

# add rank7, rank8 to meet modifications stated @Wikipedia: 
toy_data$rank7 <- 3L
toy_data$rank8 <- 2L

Calculate results for k = 6:

> calc(df = toy_data, k = 6L)
$DCG
[1] 6.861127

$IDCG
[1] 8.740262

$nDCG
[1] 0.7850024

Identical results. Just round to two digits.

"Thanks. But what is 1L and in my case where my columns are only a subset of the whole data frame, how do I do it?"

(1) Read this. (2) As already mentioned, you do not provide data. We do not know the data structure.

Some minor modifications which allows us to apply the function to a data frame (data and structure given below),

calc2 <- \(df, k) {
  d <- log2(1L:k + 1L)
  DCG <- apply(df, 1L, \(x) sum(x[1L:k] / d))
  # lets put t() in / remove as.numeric()
  IDCG <- apply(df, 1L, \(x) sum(t(sort(x, decreasing = TRUE)[1L:k] / d)))
  cbind(df, "DCG" = DCG, "IDCG" = IDCG, "nDCG" = DCG / IDCG)
}

and apply:

> calc2(df = data[, grepl("rank", colnames(data))], 
        # use all cols containing "rank" in name
        k = 6L)
  rank1 rank2 rank3 rank4 rank5 rank6 rank7 rank8      DCG     IDCG      nDCG
1     3     2     3     0     1     2     3     2 6.861127 8.740262 0.7850024
2     2     2     2     0     0     3     3     2 5.330481 8.240262 0.6468824
3     3     2     1     2     2     3     3     3 7.465540 9.170939 0.8140431
4     1     3     0     2     3     2     1     2 5.627115 7.884055 0.7137336
5     3     2     3     2     3     2     2     3 8.496185 9.170939 0.9264248

Data:

# the data is generated by resampling (replace = T) the Wiki example 
data <- structure(list(`rank1` = c(3, 2, 3, 1, 3), 
                       `rank2` = c(2, 2, 2, 3, 2), 
                       `rank3` = c(3, 2, 1, 0, 3), 
                       `rank4` = c(0, 0, 2, 2, 2), 
                       `rank5` = c(1, 0, 2, 3, 3), 
                       `rank6` = c(2, 3, 3, 2, 2), 
                       `rank7` = c(3, 3, 3, 1, 2), 
                       `rank8` = c(2, 2, 3, 2, 3),
                       `blabla` = c(9, 5, 6, 7, 3)), 
                  row.names = c(NA, -5L), class = "data.frame")