rstring-function

Intersect a vector with a dataframe column that has multiple values in tidyverse


Suppose I have these variables in R:

vals <- c("b", "c")
foo <- data.frame(x=c("a|b", "b|c", "c|d", "e|f|g"))

I'd like another column in foo that has the number of items from vals, e.g.

> foo2
      x y
1   a|b 1
2   b|c 2
3   c|d 1
4 e|f|g 0

1 because "a|b" has "b", 2 because "b|c" has "b" and "c", etc.

How do I do that with tidyverse functions?

I can split x, but the intersection isn't working. A couple of failed attempts:

library(dplyr)
library(magrittr)

> foo2 <- foo %>% mutate(x1=str_split(x, "\\|"), y=intersect(vals, x1))
Error in `mutate()`:
ℹ In argument: `y = intersect(vals, x1)`.
Caused by error:
! `y` must be size 4 or 1, not 0.
> foo2 <- foo %>% mutate(x1=str_split(x, "\\|"), y=intersect(vals, x1[[1]]))
> foo2
      x      x1 y
1   a|b    a, b b
2   b|c    b, c b
3   c|d    c, d b
4 e|f|g e, f, g b

Solution

  • You need to map (or lapply) your intersect to apply it separately to each row:

    library(purrr)
    
    foo |>
      mutate(
        xsplit = strsplit(x, split = "|", fixed = TRUE),
        intersect = map(xsplit, intersect, vals),
        y = lengths(intersect)
      )
    #       x  xsplit intersect y
    # 1   a|b    a, b         b 1
    # 2   b|c    b, c      b, c 2
    # 3   c|d    c, d         c 1
    # 4 e|f|g e, f, g           0