rdplyrtidyversestringrtidytable

tidyverse: filter with str_detect


I want to use filter command from dplyr along with str_detect.

library(tidyverse)

dt1 <- 
  tibble(
      No   = c(1, 2, 3, 4)
    , Text = c("I have a pen.", "I have a book.", "I have a pencile.", "I have a pen and a book.")
    )

dt1
# A tibble: 4 x 2
     No Text                    
  <dbl> <chr>                   
1     1 I have a pen.           
2     2 I have a book.          
3     3 I have a pencile.       
4     4 I have a pen and a book.


MatchText <- c("Pen", "Book")

dt1 %>% 
  filter(str_detect(Text,  regex(paste0(MatchText, collapse = '|'), ignore_case = TRUE)))

# A tibble: 4 x 2
     No Text                    
  <dbl> <chr>                   
1     1 I have a pen.           
2     2 I have a book.          
3     3 I have a pencile.       
4     4 I have a pen and a book.

Required Output

I want the following output in more efficient way (since in my original problem there would be many unknown element of MatchText).

dt1 %>% 
  filter(str_detect(Text,  regex("Pen", ignore_case = TRUE))) %>% 
  select(-Text) %>% 
  mutate(MatchText = "Pen") %>% 
  bind_rows(
    dt1 %>% 
      filter(str_detect(Text,  regex("Book", ignore_case = TRUE))) %>% 
      select(-Text) %>% 
      mutate(MatchText = "Book")
  )

# A tibble: 5 x 2
     No MatchText
  <dbl> <chr>    
1     1 Pen      
2     3 Pen      
3     4 Pen      
4     2 Book     
5     4 Book 

Any hint to accomplish the above task more efficiently.


Solution

  • library(tidyverse)
    dt1 %>%
      mutate(
        result = str_extract_all(Text, regex(paste0("\\b", MatchText, "\\b", collapse = '|'),ignore_case = TRUE))
      ) %>%
      unnest(result) %>%
      select(-Text)
    # # A tibble: 4 x 2
    #      No result
    #   <dbl> <chr> 
    # 1     1 pen   
    # 2     2 book  
    # 3     4 pen   
    # 4     4 book 
    

    I'm not sure what happened to the "whole words" part of your question after edits - I left in the word boundaries to match whole words, but since "pen" isn't a whole word match for "pencile", my result doesn't match yours. Get rid of the \\b if you want partial word matches.