rregexstringrgreplstringi

R - fast way to find all vector elements that contain all search terms


I have the same question answered here R - Find all vector elements that contain all strings / patterns - str_detect grep. But the suggested solution is taking too long.

I have 73,360 observations with sentences. I want a TRUE return for matches that contain ALL search strings.

sentences <- c("blue green red",
               "blue green yellow",
               "green red  yellow ")
search_terms <- c("blue","red")

pattern <- paste0("(?=.*", search_terms,")", collapse="") 
grepl(pattern, sentences, perl = TRUE)

-output

[1]  TRUE FALSE FALSE

This gives the right result, but it takes a very very very long time. Is there a faster way? I tried str_detect and got same delayed result.

BTW the "sentences" contain special characters like [],.- but no special characters like ñ.

UPDATED: below are my bemchmark results using the suggested methods, thanks to @onyambu's input.

Unit: milliseconds
                  expr       min        lq      mean    median        uq      max neval
         OP_solution() 7033.7550 7152.0689 7277.8248 7251.8419 7391.8664 7690.964   100
      map_str_detect() 2239.8715 2292.1271 2357.7432 2348.9975 2397.1758 2774.349   100
 unlist_lapply_fixed()  308.1492  331.9948  345.6262  339.9935  348.9907  586.169   100

Reduce_lapply winnnnssss! Thanks @onyambu

Unit: milliseconds
                  expr       min        lq      mean    median        uq       max neval
       Reduce_lapply()  49.02941  53.61291  55.96418  55.31494  56.76109  80.64735   100
 unlist_lapply_fixed() 318.25518 335.58883 362.03831 346.71509 357.97142 566.95738   100

Solution

  • EDIT: Another option is to loop around the search pattern instead of looping through the sentences:

    use:

    Reduce("&", lapply(search_terms, grepl, sentences, fixed = TRUE))
    [1]  TRUE FALSE FALSE
    

    benchmark

    Unit: milliseconds
                      expr      min        lq      mean    median        uq       max neval
             OP_solution()  80.6365  81.61575  85.76427  83.20265  87.32975  163.0302   100
          map_str_detect() 546.4681 563.08570 596.26190 571.52185 603.03980 1383.7969   100
     unlist_lapply_fixed()  61.8119  67.49450  71.41485  69.56290  73.77240  104.8399   100
           Reduce_lapply()   3.0604   3.11205   3.406012   3.14535   3.43130   6.3526   100
    

    Note that this is amaxingly fast!

    OLD POST:

    Make use of the all function as shown below:

    unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))
    

    the bencmark:

    OP_solution <- function(){
       pattern <- paste0("(?=.*", search_terms,")", collapse="") 
       grepl(pattern, sentences, perl = TRUE)
    }
    
    map_str_detect <- function(){
        purrr::map_lgl(
          .x = sentences,
          .f = ~ all(stringr::str_detect(.x, search_terms))
        )
    }
    
    unlist_lapply_fixed <- function() unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))
    
    
    sentences <- rep(sentences, 10000)
    microbenchmark::microbenchmark( OP_solution(),map_str_detect(),
                       unlist_lapply_fixed(), check = 'equal')
    Unit: milliseconds
                      expr      min        lq      mean    median        uq      max neval
             OP_solution()  80.5368  81.40265  85.14451  82.73985  86.41345 118.7052   100
          map_str_detect() 542.3555 553.84080 587.15748 566.66570 607.77130 782.5189   100
     unlist_lapply_fixed()  60.4955  66.94420  71.94195  69.30135  72.16735 113.6567   100