rstringr

Search multiple keywords over a column and create columns for each


I have the following data.

stringstosearch <- c("to", "and", "at", "from", "is", "of")

set.seed(199)
datatxt <- data.frame(id = c(rnorm(5)), 
                       x = c("Contrary to popular belief, Lorem Ipsum is not simply random text.",
       "A Latin professor at Hampden-Sydney College in Virginia",
       "It has roots in a piece of classical Latin ", 
       "literature from 45 BC, making it over 2000 years old.", 
       "The standard chunk of Lorem Ipsum used since"))

I want to search the keywords listed in stringtosearch and create columns for each with results.

I tried

library(stringr)
datatxt$result <- str_detect(datatxt$x, paste0(stringstosearch, collapse = '|'))

which returns

> datatxt$result
[1] TRUE TRUE TRUE TRUE TRUE

However, I am looking for an approach which creates a boolean vector for each word in stringstosearch, i.e.

          id                                                                  x    to   and    at  from    is    of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text.  TRUE FALSE FALSE FALSE  TRUE  TRUE
2  0.5551667            A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE  TRUE FALSE FALSE FALSE
3 -2.2163365                        It has roots in a piece of classical Latin  FALSE FALSE FALSE FALSE FALSE FALSE
4  0.4941455              literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE  TRUE FALSE FALSE
5 -0.5805710                       The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE FALSE

Any idea how to achieve this?


Solution

  • Here is a base R one-liner. Use sprintf() to add the \\b word boundary anchors to each pattern. This means that, for example, "and" will not match "random". Then iterate over these patterns with lapply(), using grepl() to match each pattern to datatxt$x. This returns a list of logical vectors, which we can assign back to the data frame.

    datatxt[stringstosearch] <- lapply(
        sprintf("\\b%s\\b", stringstosearch), \(x) grepl(x, datatxt$x)
    )
    

    Now datatxt is as desired:

              id                                                                  x    to   and    at  from    is    of
    1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text.  TRUE FALSE FALSE FALSE  TRUE FALSE
    2  0.5551667            A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE  TRUE FALSE FALSE FALSE
    3 -2.2163365                        It has roots in a piece of classical Latin  FALSE FALSE FALSE FALSE FALSE  TRUE
    4  0.4941455              literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE  TRUE FALSE FALSE
    5 -0.5805710                       The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE  TRUE
    

    tidyverse approach

    As you tagged tidyverse, here an alternative method. This returns the same list as the base R approach using tidyverse functions, except it's named. Then we can use the splice operator to pass this to dplyr::mutate() as new columns:

    datatxt |>
        dplyr::mutate(
            !!!purrr::map(
                purrr::set_names(
                    stringr::str_glue("\\b{stringstosearch}\\b"),
                    stringstosearch
                ),
                \(str) stringr::str_detect(x, str)
            )
        )
    # ^^ same output
    

    I think the base R approach is much cleaner.