I have a large variable containing strings (words). I need to extract all substrings that contain any of the patters listed in a separate vector.
library(tidyverse)
df <- data.frame(Word = c("hope", "freeze", "free"))
patterns <- "hope|freeze|free|du|li|un|de|em|bi|en|im|ro|gi|ai|ag|wo|ab|di|ac|eu|ic|se|al|ob|ig|es|ef|sy|ep|ec|y|u|e|o|a|h|i"
df %>%
mutate(simple = str_extract_all(Word, patterns))
However, it looks like the function returns the most complete string depending on the order the patterns
are in. So, for example, if patterns
has the order shown above, the result will be:
Word simple
1 hope hope
2 freeze freeze
3 free free
If the order is reversed (i.e., descending order with respect to length:
patterns2 <-"y|u|e|o|a|h|i|du|li|un|de|em|bi|en|im|ro|gi|ai|ag|wo|ab|di|ac|eu|ic|se|al|ob|ig|es|ef|sy|ep|ec|hope|freeze|free"
df %>%
mutate(simple = str_extract_all(Word, patterns2))
Word simple
1 hope h, o, e
2 freeze freeze
3 free free
Is there a way to get all potential patterns, regardless of the order of the patterns? Here's the desired output:
Word simple
1 hope h, o, e, hope
2 freeze freeze
3 free free
You can split the pattern into a vector of sub-patterns, and then extract the elements included in each word.
pat_vec <- str_split_1(patterns, fixed('|'))
df %>%
mutate(simple = lapply(Word, \(x) pat_vec[str_which(x, pat_vec)]))
# Word simple
# 1 hope hope, e, o, h
# 2 freeze freeze, free, e
# 3 free free, e