I have been using the if_all
and if_any
functions from the dplyr
package for data manipulation in R. However, I have noticed that these functions can be quite slow when handling large datasets. Therefore, I am looking for alternatives available in the collapse
or kit
R packages.
While I attempted to use the collapse
package, I am not obtaining the expected results. Below is the output from my R code, which illustrates the discrepancies I encountered:
library(tidyverse)
library(collapse)
set.seed(12345)
df1 <-
data.frame(
A1 = runif(n = 10, min = 1, max = 3) %>% round
, A2 = runif(n = 10, min = 1, max = 4) %>% round
, A3 = runif(n = 10, min = 1, max = 5) %>% round
)
df1
#> A1 A2 A3
#> 1 2 1 3
#> 2 3 1 2
#> 3 3 3 5
#> 4 3 1 4
#> 5 2 2 4
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 2 3
#> 9 2 2 2
#> 10 3 4 3
df1 %>%
mutate(
A4 = if_all(c(A2, A3), \(x) x %in% c(1, 2))
, A5 = if_any(c(A2, A3), \(x) x %in% c(1, 2))
)
#> A1 A2 A3 A4 A5
#> 1 2 1 3 FALSE TRUE
#> 2 3 1 2 TRUE TRUE
#> 3 3 3 5 FALSE FALSE
#> 4 3 1 4 FALSE TRUE
#> 5 2 2 4 FALSE TRUE
#> 6 1 2 3 FALSE TRUE
#> 7 2 2 4 FALSE TRUE
#> 8 2 2 3 FALSE TRUE
#> 9 2 2 2 TRUE TRUE
#> 10 3 4 3 FALSE FALSE
df1 %>%
fmutate(
A4 = fsum(fselect(., A2, A3) %in% c(1, 2)) == 2
, A5 = fsum(fselect(., A2, A3) %in% c(1, 2)) >= 1
)
#> A1 A2 A3 A4 A5
#> 1 2 1 3 FALSE FALSE
#> 2 3 1 2 FALSE FALSE
#> 3 3 3 5 FALSE FALSE
#> 4 3 1 4 FALSE FALSE
#> 5 2 2 4 FALSE FALSE
#> 6 1 2 3 FALSE FALSE
#> 7 2 2 4 FALSE FALSE
#> 8 2 2 3 FALSE FALSE
#> 9 2 2 2 FALSE FALSE
#> 10 3 4 3 FALSE FALSE
How can I properly implement alternatives to if_all
and if_any
using the collapse
or kit
packages?
You don't say if you're also using the collapse
version of %in%
, but if not, I'd start there.
See the help page for fmatch
, where it suggests setting the %in%
to use a faster version based on fmatch
by running
set_collapse(mask = "%in%")
I'd then try running your initial code again. When I do this with an initial data size of 1e7 rows, the runtime drops from 1.6s to 0.6s.
I thought boolean operators might be faster too, but when I compared this code it was actually slower than the initial code.
df1 %>%
mutate(
A4 = (A2==1 | A2==2) & (A3==1 | A3==2),
A5 = (A2==1 | A2==2) | (A3==1 | A3==2)
)