I am having a hard time understanding how to compute nDCG for my data. I am relatively new to R and I am struggling to translate the formulas to real practice in R. I have a record of 50 columns and I am only interested in 5 columns that have my relevancy score for search results. I would like to compute nDCG from these 5 columns which represent different ranks. I managed to compute the DCG with the equation below:
data1$DCG <- data1$rank1/log2 (1+1) + data1$rank2/log2(2+1) + data1$rank3/log2(3+1) + data1$rank4/log2(4+1) + data1$rank5/log2(5+1)
My question is how do I continue to compute the ideal DCG which is an ordered DCG and compute it in a loop for all rows?
nDCG is relatively simple and I will manage, I just need assistance with the ideal DCG, here is the formula:
Can someone please assist me?
This answer is intended to show how the formulas can be implemented in R
.
From the Wikipedia's article "Discounted cumulative gain", I scrapped the example as toy data:
library(rvest)
table <- "https://en.wikipedia.org/wiki/Discounted_cumulative_gain" |>
read_html() |>
html_table() |>
{ \(x) x[[2L]] }() |>
as.data.frame() |>
`colnames<-`(c("i", "rel_i", "log_2(i+1)", "bracket"))
gives
> table
i rel_i log_2(i+1) bracket
1 1 3 1.000 3.000
2 2 2 1.585 1.262
3 3 3 2.000 1.500
4 4 0 2.322 0.000
5 5 1 2.585 0.387
6 6 2 2.807 0.712
Reshaping the data to meet my assumptions about the structure of the OP's data.
toy_data <- as.data.frame(t(table[, c("rel_i")])) |>
`colnames<-`(paste0("rank", 1L:6L))
Note, my assumptions can be wrong. OP does not provide data. This gives
> toy_data
rank1 rank2 rank3 rank4 rank5 rank6
1 3 2 3 0 1 2
Let's wrap the steps to compute "DCG", "IDCG", "nDCG" (definitions taken from Wikipedia) in a small function called calc
:
calc <- \(df, k) {
d <- log2(1L:k + 1L)
DCG <- sum(df[1L, 1L:k] / d)
IDCG <- sum(sort(as.numeric(df[1L, ]), decreasing = TRUE)[1L:k] / d)
nDCG <- DCG / IDCG
list("DCG" = DCG, "IDCG" = IDCG, "nDCG" = nDCG)
}
Let's rerun the extended example stated in the Wikipedia article to check if the implementation is correct.
# add rank7, rank8 to meet modifications stated @Wikipedia:
toy_data$rank7 <- 3L
toy_data$rank8 <- 2L
k = 6
:> calc(df = toy_data, k = 6L)
$DCG
[1] 6.861127
$IDCG
[1] 8.740262
$nDCG
[1] 0.7850024
Identical results. Just round to two digits.
"Thanks. But what is 1L and in my case where my columns are only a subset of the whole data frame, how do I do it?"
(1) Read this. (2) As already mentioned, you do not provide data. We do not know the data structure.
Some minor modifications which allows us to apply the function to a data frame (data
and structure given below),
calc2 <- \(df, k) {
d <- log2(1L:k + 1L)
DCG <- apply(df, 1L, \(x) sum(x[1L:k] / d))
# lets put t() in / remove as.numeric()
IDCG <- apply(df, 1L, \(x) sum(t(sort(x, decreasing = TRUE)[1L:k] / d)))
cbind(df, "DCG" = DCG, "IDCG" = IDCG, "nDCG" = DCG / IDCG)
}
and apply:
> calc2(df = data[, grepl("rank", colnames(data))],
# use all cols containing "rank" in name
k = 6L)
rank1 rank2 rank3 rank4 rank5 rank6 rank7 rank8 DCG IDCG nDCG
1 3 2 3 0 1 2 3 2 6.861127 8.740262 0.7850024
2 2 2 2 0 0 3 3 2 5.330481 8.240262 0.6468824
3 3 2 1 2 2 3 3 3 7.465540 9.170939 0.8140431
4 1 3 0 2 3 2 1 2 5.627115 7.884055 0.7137336
5 3 2 3 2 3 2 2 3 8.496185 9.170939 0.9264248
Data:
# the data is generated by resampling (replace = T) the Wiki example
data <- structure(list(`rank1` = c(3, 2, 3, 1, 3),
`rank2` = c(2, 2, 2, 3, 2),
`rank3` = c(3, 2, 1, 0, 3),
`rank4` = c(0, 0, 2, 2, 2),
`rank5` = c(1, 0, 2, 3, 3),
`rank6` = c(2, 3, 3, 2, 2),
`rank7` = c(3, 3, 3, 1, 2),
`rank8` = c(2, 2, 3, 2, 3),
`blabla` = c(9, 5, 6, 7, 3)),
row.names = c(NA, -5L), class = "data.frame")