rdataframelistdplyrimputation

Median imputation to a list by mutate() in dplyr


I want to replace missing data with median values to a dataframe within a list. I can do this by entering the column name. However, how can I do this when I need to randomly select the column in a simulation study?

For example:

mylist <- list(structure(list(V1 = c(3L, 16L, 8L, 2L, 17L, 6L, 10L, 15L, 
7L, 11L), V2 = c(9L, NA, 14L, 18L, NA, 20L, 15L, 17L, 3L, NA), 
    V3 = c(4L, 1L, 10L, 9L, 7L, 13L, 16L, 8L, 17L, 18L)), row.names = c(NA, 
-10L), class = "data.frame"), structure(list(V1 = c(6L, 12L, 
14L, 10L, 5L, 20L, 26L, 2L, 23L, 1L), V2 = c(6L, 15L, NA, 30L, 
NA, 14L, 2L, 11L, NA, 3L), V3 = c(18L, 12L, 3L, 2L, 8L, 23L, 
13L, 16L, 17L, 7L)), row.names = c(NA, -10L), class = "data.frame"), 
    structure(list(V1 = c(18L, 26L, 9L, 28L, 8L, 4L, 29L, 24L, 
    37L, 3L), V2 = c(NA, 36L, 13L, 19L, NA, 31L, 20L, 7L, NA, 
    16L), V3 = c(NA, 25L, NA, NA, NA, 21L, 17L, 4L, 32L, 6L)), row.names = c(NA, 
    -10L), class = "data.frame"))

newlist <- list()
for (k in 1:3) {
  newlist[[k]] <- mylist[[k]] %>%
    mutate(V2 = replace_na(V2, median(V2, na.rm = TRUE)))
}

newlist

I have successfully done this for column named V2 (as you can see above).

ch_column <- sample(1:3, 1)
ch_column

How can I do if I select the column with the help of sample() function? I need to change the places named V2 (with ch_column) in the first codes I shared.


Solution

  • You can create a character string of column name, and inject it on the left-hand side of ⁠:=⁠.

    imp_fun <- function(df, col) {
      var <- paste0('V', col)
      df %>%
        mutate(!!var := replace_na(.data[[var]], median(.data[[var]], na.rm = TRUE)))
    }
    
    newlist <- lapply(mylist, imp_fun, col = ch_column)
    
    ch_column
    # [1] 2
    
    newlist
    # [[1]]
    #    V1 V2 V3
    # 1   3  9  4
    # 2  16 15  1
    # 3   8 14 10
    # 4   2 18  9
    # 5  17 15  7
    # 6   6 20 13
    # 7  10 15 16
    # 8  15 17  8
    # 9   7  3 17
    # 10 11 15 18
    # 
    # [[2]]
    # ...
    # 
    # [[3]]
    # ...
    

    If you are not familiar with how lapply works, the code above is equivalent to the following for loop.

    newlist <- list()
    ch_column <- sample(1:3, 1)
    var <- paste0('V', ch_column)
    for (k in 1:3) {
      newlist[[k]] <- mylist[[k]] %>%
        mutate(!!var := replace_na(.data[[var]], median(.data[[var]], na.rm = TRUE)))
    }