I have a very large (long and wide) survey response dataset (survey
) that I need to clean. I have created a cleaning dataset (fixes
) that contains the cleaning instructions I want to apply. I'm looking for a way to allow me to specify quite flexibly how widely the cleaning instructions are applied to the survey dataset.
The fixes
dataset looks like this:
library(dplyr)
fixes <- tribble(
~variable, ~file, ~return_type, ~original, ~fix,
"obj_n", NA, NA, "One objective", "1",
"rdate", "f1", NA, "31/3/202", "31/3/2021",
"funding", NA, "Close", "one thousand", "1000",
"funding", "f3", "Close", "10000", "199",
# if this last line is excluded, my existing syntax works. Note that 'original' and 'fix' are text.
NA, NA, NA, "N/A", "NA"
)
A survey dataset might look like this:
survey <- tribble(
~id, ~file, ~return_type, ~obj_n, ~rdate, ~funding, ~test,
"1", "f1", "Start", "2", "31/3/202", "10000", "N/A",
"2", "f1", "Close", "One objective", "1/12/2023", "one thousand", "test text",
"3", "f2", "Start", "One objective", "1/3/2021", "N/A", NA,
"4", "f2", "Close", "One objective", "N/A", "one thousand", "one thousand",
"5", "f3", "Start", NA, NA, "one thousand", NA,
"6", "f3", "Close", "N/A", "31/3/2020", "10000", "hello"
)
Applying the functions should result in:
desired <- tribble(
~id, ~file, ~return_type, ~obj_n, ~rdate, ~funding, ~test,
"1", "f1", "Start", "2", "31/3/2021", "10000", "NA",
"2", "f1", "Close", "1", "1/12/2023", "1000", "test text",
"3", "f2", "Start", "1", "1/3/2021", "NA", NA,
"4", "f2", "Close", "1", "NA", "1000", "one thousand",
"5", "f3", "Start", NA, NA, "one thousand", NA,
"6", "f3", "Close", "NA", "31/3/2020", "199", "hello"
)
I have managed to get this working if fixes$variable
is defined. Fixes are then applied to a particular file
or return_type
if they are defined, and generically if they are not. Where this doesn't work is if fixes$variable
is NA
(i.e. the last line in fixes
):
# run through the `fixes` tibble, and, for the variable defined in column "variable", replaces any occurrence of the
# string defined in "original" with the string defined in "fix". Note that all variables are still strings at this
# stage.
result <- survey
invisible(
apply(fixes, MARGIN = 1, FUN = function(x) {
result <<- result |> mutate(
!!x["variable"] := if_else(
(!!sym(x["variable"]) == x["original"]) &
(
((is.na(x["file"])) & (is.na(x["return_type"]))) |
((is.na(x["file"])) & (!is.na(!!sym("return_type"))) &
(!!sym("return_type") == x["return_type"])) |
((!is.na(!!sym("file"))) & (!!sym("file") == x["file"]) & (is.na(x["return_type"]))) |
((!is.na(!!sym("file"))) & (!!sym("file") == x["file"]) &
(!is.na(!!sym("return_type"))) & (!!sym("return_type") == x["return_type"]))
),
x["fix"],
!!sym(x["variable"])
)
)
})
)
(as aside, all variables are text at this stage, and assigned the correct type in the next step)
Any help would be appreciated.
Alright, I found one way of doing it. I will vote as correct any better answer.
result <- survey
for (i in 1:nrow(fixes)){
result[
(is.na(fixes[[i, "file"]]) | (result[["file"]] == fixes[[i, "file"]])) &
(is.na(fixes[[i, "return_type"]]) | (result[["return_type"]] == fixes[[i, "return_type"]])),
] <-
result |>
filter(
(is.na(fixes[[i, "file"]]) | (file == fixes[[i, "file"]])) &
(is.na(fixes[[i, "return_type"]]) | (return_type == fixes[[i, "return_type"]]))
) |>
mutate(
across(
if(is.na(fixes[[i, "variable"]])) {everything()} else {fixes[[i, "variable"]]},
~ if_else(.x == fixes[[i, "original"]], fixes[[i, "fix"]], .x)
)
)
}
identical(result, desired) # TRUE