rdplyrtidyverser-collapsekit

Alternatives to if_all and if_any for Data Manipulation in R


I have been using the if_all and if_any functions from the dplyr package for data manipulation in R. However, I have noticed that these functions can be quite slow when handling large datasets. Therefore, I am looking for alternatives available in the collapse or kit R packages.

While I attempted to use the collapse package, I am not obtaining the expected results. Below is the output from my R code, which illustrates the discrepancies I encountered:

library(tidyverse)
library(collapse)

set.seed(12345)
df1 <-
  data.frame(
    A1 = runif(n = 10, min = 1, max = 3) %>% round
  , A2 = runif(n = 10, min = 1, max = 4) %>% round
  , A3 = runif(n = 10, min = 1, max = 5) %>% round
  )

df1
#>    A1 A2 A3
#> 1   2  1  3
#> 2   3  1  2
#> 3   3  3  5
#> 4   3  1  4
#> 5   2  2  4
#> 6   1  2  3
#> 7   2  2  4
#> 8   2  2  3
#> 9   2  2  2
#> 10  3  4  3

df1 %>% 
  mutate(
    A4 = if_all(c(A2, A3), \(x) x %in% c(1, 2))
  , A5 = if_any(c(A2, A3), \(x) x %in% c(1, 2))
    )
#>    A1 A2 A3    A4    A5
#> 1   2  1  3 FALSE  TRUE
#> 2   3  1  2  TRUE  TRUE
#> 3   3  3  5 FALSE FALSE
#> 4   3  1  4 FALSE  TRUE
#> 5   2  2  4 FALSE  TRUE
#> 6   1  2  3 FALSE  TRUE
#> 7   2  2  4 FALSE  TRUE
#> 8   2  2  3 FALSE  TRUE
#> 9   2  2  2  TRUE  TRUE
#> 10  3  4  3 FALSE FALSE
  
df1 %>% 
  fmutate(
    A4 = fsum(fselect(., A2, A3) %in% c(1, 2)) == 2
  , A5 = fsum(fselect(., A2, A3) %in% c(1, 2)) >= 1
    )
#>    A1 A2 A3    A4    A5
#> 1   2  1  3 FALSE FALSE
#> 2   3  1  2 FALSE FALSE
#> 3   3  3  5 FALSE FALSE
#> 4   3  1  4 FALSE FALSE
#> 5   2  2  4 FALSE FALSE
#> 6   1  2  3 FALSE FALSE
#> 7   2  2  4 FALSE FALSE
#> 8   2  2  3 FALSE FALSE
#> 9   2  2  2 FALSE FALSE
#> 10  3  4  3 FALSE FALSE

How can I properly implement alternatives to if_all and if_any using the collapse or kit packages?


Solution

  • You don't say if you're also using the collapse version of %in%, but if not, I'd start there.

    See the help page for fmatch, where it suggests setting the %in% to use a faster version based on fmatch by running

    set_collapse(mask = "%in%")
    

    I'd then try running your initial code again. When I do this with an initial data size of 1e7 rows, the runtime drops from 1.6s to 0.6s.

    I thought boolean operators might be faster too, but when I compared this code it was actually slower than the initial code.

    df1 %>% 
      mutate(
        A4 = (A2==1 | A2==2) & (A3==1 | A3==2),
        A5 = (A2==1 | A2==2) | (A3==1 | A3==2)
      )