I want to replace missing data with median values to a dataframe within a list. I can do this by entering the column name. However, how can I do this when I need to randomly select the column in a simulation study?
For example:
mylist <- list(structure(list(V1 = c(3L, 16L, 8L, 2L, 17L, 6L, 10L, 15L,
7L, 11L), V2 = c(9L, NA, 14L, 18L, NA, 20L, 15L, 17L, 3L, NA),
V3 = c(4L, 1L, 10L, 9L, 7L, 13L, 16L, 8L, 17L, 18L)), row.names = c(NA,
-10L), class = "data.frame"), structure(list(V1 = c(6L, 12L,
14L, 10L, 5L, 20L, 26L, 2L, 23L, 1L), V2 = c(6L, 15L, NA, 30L,
NA, 14L, 2L, 11L, NA, 3L), V3 = c(18L, 12L, 3L, 2L, 8L, 23L,
13L, 16L, 17L, 7L)), row.names = c(NA, -10L), class = "data.frame"),
structure(list(V1 = c(18L, 26L, 9L, 28L, 8L, 4L, 29L, 24L,
37L, 3L), V2 = c(NA, 36L, 13L, 19L, NA, 31L, 20L, 7L, NA,
16L), V3 = c(NA, 25L, NA, NA, NA, 21L, 17L, 4L, 32L, 6L)), row.names = c(NA,
-10L), class = "data.frame"))
newlist <- list()
for (k in 1:3) {
newlist[[k]] <- mylist[[k]] %>%
mutate(V2 = replace_na(V2, median(V2, na.rm = TRUE)))
}
newlist
I have successfully done this for column named V2
(as you can see above).
ch_column <- sample(1:3, 1)
ch_column
How can I do if I select the column with the help of sample()
function? I need to change the places named V2
(with ch_column
) in the first codes I shared.
You can create a character string of column name, and inject it on the left-hand side of :=
.
imp_fun <- function(df, col) {
var <- paste0('V', col)
df %>%
mutate(!!var := replace_na(.data[[var]], median(.data[[var]], na.rm = TRUE)))
}
newlist <- lapply(mylist, imp_fun, col = ch_column)
ch_column
# [1] 2
newlist
# [[1]]
# V1 V2 V3
# 1 3 9 4
# 2 16 15 1
# 3 8 14 10
# 4 2 18 9
# 5 17 15 7
# 6 6 20 13
# 7 10 15 16
# 8 15 17 8
# 9 7 3 17
# 10 11 15 18
#
# [[2]]
# ...
#
# [[3]]
# ...
If you are not familiar with how lapply
works, the code above is equivalent to the following for loop.
newlist <- list()
ch_column <- sample(1:3, 1)
var <- paste0('V', ch_column)
for (k in 1:3) {
newlist[[k]] <- mylist[[k]] %>%
mutate(!!var := replace_na(.data[[var]], median(.data[[var]], na.rm = TRUE)))
}