rdplyrpurrr

How to write custom functions that you can apply to columns in purrrr


I need to apply a custom function to a set of columns in a dataset and return a list. I can do this in lapply() but I am trying to work with purrr.

Toy data. Three factor variables

df <- data.frame(variableA = factor(sample(x = c("lessThan", "moreThan"),
                                           size = 20,
                                           replace = T,
                                           prob = c(0.5, 0.5))),
                 variableB = factor(sample(x = c("lessThan", "moreThan"),
                                           size = 20,
                                           replace = T,
                                           prob = c(0.2, 0.8))),
                 variableC = factor(sample(x = c("lessThan", "moreThan"),
                                           size = 20,
                                           replace = T,
                                           prob = c(0.4, 0.6)))) 

Now we create the function, one that returns a dataframe breaking down the proportions of each level of the outcome variable, which we pass into the function as a string.

countMoreLessFunct <- function(data, var) {
  data %>%
    group_by(.data[[var]]) %>%
      summarise(count = n()) %>%
        ungroup %>%
          mutate(tot = sum(count),
                 perc = round(x = count/tot*100,
                              digits = 2))
}

The function works fine with a single variable.

countMoreLessFunct(data = df,
                   var = "variableA")

# output
#   variableA count   tot  perc
#   <fct>     <int> <int> <dbl>
# 1 lessThan     11    20    55
# 2 moreThan      9    20    45

It also works with lapply()

lapply(names(df), function(i) countMoreLessFunct(df, i))

But when I try it in purrr I get all sorts of errors

df %>%
  map(.f = ~countMoreLessFunct(df, .x))

The above, for example, returns the error

# Error in `map()`:
#   ℹ In index: 1.
# ℹ With name: variableA.
# Caused by error in `group_by()`:
#   ℹ In argument: `.data[[structure(c(1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, `.
# Caused by error in `.data[[<fct>]]`:
#   ! Must subset the data pronoun with a string, not a <factor> object. 

I am lost. The problem obviously lies in the original function, the fact that it requires a string maybe? Any help appreciated


Solution

  • In your example, map iterates over the column contents (factors) rather than the column names your function expects. You can use {purrr}'s imap instead, which provides the index/name under .y:

    df %>%
      imap(.f = ~countMoreLessFunct(df, .y))
    

    Since you're into using {purrr}, you could also create your dataframe by mapping a list of "less-than" probabilities:

    df <- list(.5, .2, .4) |> 
      map_dfc( ~ sample(x = c("lessThan", "moreThan"),
                        size = 20, replace = TRUE,
                        prob = c(.x, 1 - .x)
                        ) |> factor()
      ) |> 
      setNames(nm = paste0('variable', LETTERS[1:3]))