rdplyr

Correct a dataset using a lookup dataset


I have a very large (long and wide) survey response dataset (survey) that I need to clean. I have created a cleaning dataset (fixes) that contains the cleaning instructions I want to apply. I'm looking for a way to allow me to specify quite flexibly how widely the cleaning instructions are applied to the survey dataset.

The fixes dataset looks like this:

library(dplyr)
fixes <- tribble(
  ~variable, ~file,  ~return_type, ~original,         ~fix,
  "obj_n",   NA,     NA,           "One objective",   "1",
  "rdate",   "f1",   NA,           "31/3/202",        "31/3/2021",
  "funding", NA,     "Close",      "one thousand",    "1000",
  "funding", "f3",   "Close",      "10000",           "199",
  # if this last line is excluded, my existing syntax works. Note that 'original' and 'fix' are text.
  NA,        NA,     NA,           "N/A",             "NA" 
)

A survey dataset might look like this:

survey <- tribble(
  ~id, ~file, ~return_type, ~obj_n,          ~rdate,      ~funding,     ~test,
  "1", "f1",  "Start",      "2",             "31/3/202",  "10000",        "N/A",
  "2", "f1",  "Close",      "One objective", "1/12/2023", "one thousand", "test text",
  "3", "f2",  "Start",      "One objective", "1/3/2021",  "N/A",          NA,
  "4", "f2",  "Close",      "One objective", "N/A",       "one thousand", "one thousand",
  "5", "f3",  "Start",      NA,              NA,          "one thousand", NA,
  "6", "f3",  "Close",      "N/A",           "31/3/2020", "10000",        "hello"
)

Applying the functions should result in:

desired <- tribble(
  ~id, ~file, ~return_type, ~obj_n,          ~rdate,      ~funding,     ~test,
  "1", "f1",  "Start",      "2",             "31/3/2021", "10000",        "NA",
  "2", "f1",  "Close",      "1",             "1/12/2023", "1000",         "test text",
  "3", "f2",  "Start",      "1",             "1/3/2021",  "NA",           NA,
  "4", "f2",  "Close",      "1",             "NA",        "1000",         "one thousand",
  "5", "f3",  "Start",      NA,              NA,          "one thousand", NA,
  "6", "f3",  "Close",      "NA",            "31/3/2020", "199",          "hello"
)

I have managed to get this working if fixes$variable is defined. Fixes are then applied to a particular file or return_type if they are defined, and generically if they are not. Where this doesn't work is if fixes$variable is NA (i.e. the last line in fixes):

# run through the `fixes` tibble, and, for the variable defined in column "variable", replaces any occurrence of the
# string defined in "original" with the string defined in "fix". Note that all variables are still strings at this
# stage.
result <- survey
invisible(
  apply(fixes, MARGIN = 1, FUN = function(x) {
    result <<- result |> mutate(
      !!x["variable"] := if_else(
        (!!sym(x["variable"]) == x["original"]) &
          (
            ((is.na(x["file"])) & (is.na(x["return_type"]))) |
              
              ((is.na(x["file"])) &  (!is.na(!!sym("return_type"))) & 
                 (!!sym("return_type") == x["return_type"])) |
              
              ((!is.na(!!sym("file"))) & (!!sym("file") == x["file"]) & (is.na(x["return_type"]))) |
              
              ((!is.na(!!sym("file"))) & (!!sym("file") == x["file"]) & 
                 (!is.na(!!sym("return_type"))) & (!!sym("return_type") == x["return_type"]))
          ),
        x["fix"],
        !!sym(x["variable"])
      )
    )
  })
)

(as aside, all variables are text at this stage, and assigned the correct type in the next step)

Any help would be appreciated.


Solution

  • Alright, I found one way of doing it. I will vote as correct any better answer.

    result <- survey
    for (i in 1:nrow(fixes)){
      result[
        (is.na(fixes[[i, "file"]]) | (result[["file"]] ==  fixes[[i, "file"]])) &
          (is.na(fixes[[i, "return_type"]]) | (result[["return_type"]] == fixes[[i, "return_type"]])), 
      ] <-
        result |> 
        filter(
          (is.na(fixes[[i, "file"]]) | (file ==  fixes[[i, "file"]])) & 
            (is.na(fixes[[i, "return_type"]]) | (return_type == fixes[[i, "return_type"]]))
        ) |> 
        mutate(
          across(
            if(is.na(fixes[[i, "variable"]])) {everything()} else {fixes[[i, "variable"]]},
            ~ if_else(.x == fixes[[i, "original"]], fixes[[i, "fix"]], .x)
          )
        )
    }
    identical(result, desired) # TRUE