rdata.tabledtplyr

data.table fill NA by custom function and other cells


Assume we have a data.table like:

library(data.table)

set.seed(123666)


dt <- data.table(
  id = seq(1, 5), 
  sample1 = c(sample(c(NA, runif(2))), NA),      
  sample2 = c(NA, sample(c(NA, runif(3)))),    
  sample3 = c(sample(c(NA, runif(4))))      
)
dt
   id   sample1   sample2   sample3
1:  1        NA        NA 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115        NA 0.7849779
4:  4        NA 0.6875024 0.3684756
5:  5        NA 0.4859773        NA

Its have many NA values, now, we want to fill it, typically, we can use following syntax to do

dt[is.na(dt)] <- 0
dt
   id   sample1   sample2   sample3
1:  1 0.0000000 0.0000000 0.6387276
2:  2 0.9293370 0.1875354 0.2087892
3:  3 0.1528115 0.0000000 0.7849779
4:  4 0.0000000 0.6875024 0.3684756
5:  5 0.0000000 0.4859773 0.0000000

However, if we want to fill NA with more complex rule, a custom function for example, calc_data(), to calc NA. This function need two input, just a example here, first is the id value of NA value, secound is the colname or colname index of the cell.

# example, not real function
sample_value <- c(1, 3, 3)
names(sample_value) <- c('sample1', 'sample2', 'sample3')
calc_data <- function(sample, id) {
    na_calc <- id * 3 + sample_value[sample]
}

Now, it is possible to fill NA with this coustom function with data.table syntax. how to put its required value to calc_data


Solution

  • perhaps something like this could work

    for(j in 2:4) set(dt, 
                      i = which(is.na(dt[[j]])), 
                      j = j, 
                      value = calc_data(j - 1, dt[which(is.na(dt[[j]])), "id"]))
    

    output

       id    sample1    sample2    sample3
    1:  1  0.8055845 6.00000000  0.2030456
    2:  2  0.5705721 9.00000000  0.7954992
    3:  3 10.0000000 0.09605308 12.0000000
    4:  4 13.0000000 0.25545666  0.6506906
    5:  5  0.8055845 0.51889032  0.8931946