rdplyrdata-analysisquasiquotes

R: Select the first non-NA data for each row, within an arbitrarily long subset of variables


I'm trying to write a function in R; let's call it favor(). It needs to accept an arbitrary number of columns of a tibble, which may not contain complete data. For each row in the dataset, favor() will examine the contents of each column provided (in the order provided) and "favor" the first one which contains a non-missing value, copying that value to a new column and returning the resulting dataset. So, for an input dataset that looks like this:

data <- tibble(
          a = c(1, 2, 3, 4),
          b = c(NA, NA, 7, NA),
          c = c(9, NA, 11, NA)
         )

# The output should look something like this:

result <- favor(data, b, c, a)

> result
# A tibble: 4 × 4
      a     b     c favored
  <dbl> <dbl> <dbl>   <dbl>
1     1    NA     9       9
2     2    NA    NA       2
3     3     7    11       7
4     4    NA    NA       4

This is easy enough to do when you have a set number of variables to look through; the difficulty is that I want this to work for an arbitrarily large subset of columns from the input dataset. I seems that dot-dot-dot notation will need to be used, so I think the function header will look something like:

favor <- function(data, ...) {}

But beyond this I'm a little stumped on what the best way to implement this would be. I'm working with dplyr, so I've tried thinking about something like:

favor <- function(data, ...) {

  result <- data %>%
    mutate(favor = first(na.omit({{ ... }})))

}

But this doesn't seem to work, I think because this isn't the right way to do embrasure around .... So, what's the right approach for this mapping function? Do I need to loop over all rows in data and all values in ...? Is there something more clever I can do, using proper quasi-quotation?

A few requirements about my use case I want to note:

And, if it helps to conceptualize this as a more specific use case: one use of this will be to choose a date from multiple date columns that may contain missing data. In a dataset of events, date column A might be the most accurate date to use to represent that event, but if it's missing, then date column B is the next best proxy, date column C is the next best proxy after that, and so on.

I think this question comes the closest to what I'm asking (and the use of first() is probably a good lead), but my data won't be grouped and again, I need this to work for an arbitrary number of columns.


Solution

  • library(dplyr)
    
    favor <- function(data, ...) {
      data |>
        mutate(favored = coalesce(!!!ensyms(...)))
    }
    
    favor(data, b, c, a)
    # a     b     c favored
    # <dbl> <dbl> <dbl>   <dbl>
    # 1     1    NA     9       9
    # 2     2    NA    NA       2
    # 3     3     7    11       7
    # 4     4    NA    NA       4
    

    ?coalesce does exactly what you are looking for:

    Given a set of vectors, coalesce() finds the first non-missing value at each position.

    You simply need to unpack your ellipsis into the function call:

    favor <- function(data, ...) {
      data |>
        mutate(favored = coalesce(!!!ensyms(...))) |>
        rlang::qq_show()
    }
    
    favor(data, b, c, a)
    # mutate(data, favored = coalesce(b, c, a))
    
    favor(data, c, b, a)
    # mutate(data, favored = coalesce(c, b, a))