I have the following data.
stringstosearch <- c("to", "and", "at", "from", "is", "of")
set.seed(199)
datatxt <- data.frame(id = c(rnorm(5)),
x = c("Contrary to popular belief, Lorem Ipsum is not simply random text.",
"A Latin professor at Hampden-Sydney College in Virginia",
"It has roots in a piece of classical Latin ",
"literature from 45 BC, making it over 2000 years old.",
"The standard chunk of Lorem Ipsum used since"))
I want to search the keywords listed in stringtosearch
and create columns for each with results.
I tried
library(stringr)
datatxt$result <- str_detect(datatxt$x, paste0(stringstosearch, collapse = '|'))
which returns
> datatxt$result
[1] TRUE TRUE TRUE TRUE TRUE
However, I am looking for an approach which creates a boolean vector for each word in stringstosearch
, i.e.
id x to and at from is of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text. TRUE FALSE FALSE FALSE TRUE TRUE
2 0.5551667 A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE TRUE FALSE FALSE FALSE
3 -2.2163365 It has roots in a piece of classical Latin FALSE FALSE FALSE FALSE FALSE FALSE
4 0.4941455 literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE TRUE FALSE FALSE
5 -0.5805710 The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE FALSE
Any idea how to achieve this?
Here is a base R one-liner. Use sprintf()
to add the \\b
word boundary anchors to each pattern. This means that, for example, "and"
will not match "random"
. Then iterate over these patterns with lapply()
, using grepl()
to match each pattern to datatxt$x
. This returns a list of logical vectors, which we can assign back to the data frame.
datatxt[stringstosearch] <- lapply(
sprintf("\\b%s\\b", stringstosearch), \(x) grepl(x, datatxt$x)
)
Now datatxt
is as desired:
id x to and at from is of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text. TRUE FALSE FALSE FALSE TRUE FALSE
2 0.5551667 A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE TRUE FALSE FALSE FALSE
3 -2.2163365 It has roots in a piece of classical Latin FALSE FALSE FALSE FALSE FALSE TRUE
4 0.4941455 literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE TRUE FALSE FALSE
5 -0.5805710 The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE TRUE
tidyverse
approachAs you tagged tidyverse
, here an alternative method. This returns the same list as the base R approach using tidyverse
functions, except it's named. Then we can use the splice operator to pass this to dplyr::mutate()
as new columns:
datatxt |>
dplyr::mutate(
!!!purrr::map(
purrr::set_names(
stringr::str_glue("\\b{stringstosearch}\\b"),
stringstosearch
),
\(str) stringr::str_detect(x, str)
)
)
# ^^ same output
I think the base R approach is much cleaner.