I have the task of identifying the group to which a sentence belongs according to the use of specific words, for example identifying which color has been used to describe an animal. I have a dictionary of the words I want to identify in this way:
df <- data.frame(id = c(1:5), pets = c("brown dog", "black cat", "orange cat", "black bird", "white hamster"))
dictionary <- c("black", "orange", "white", "brown", "green", "red")
I need to match the pets with the dictionary indicating which categories they are, to my final df be like:
final_df <- data.frame(id = c(1:5),
pets = c("brown dog", "black cat", "orange cat", "black bird", "white hamster"),
color = c("brown", "black", "orange", "black", "white"))
Using the stringr
package:
library(stringr)
regex <- str_c("\\b", dictionary, "\\b", collapse = "|")
df$color <- str_extract(df$pets, regex)
# "brown" "black" "orange" "black" "white"
In base R:
regex <- paste0(".*(", paste0("\\b", dictionary, "\\b", collapse = "|"), ").*")
df$color <- sub(regex, "\\1", df$pets)
# "brown" "black" "orange" "black" "white"
Both these solutions do the same thing. First construct a regular expression pattern match using dictionary
. This regular expression will capture any of the complete words: black, orange, etc.
Due to the use of word boundaries (e.g. \\b
) this pattern match will only match these words. For example, if you had white-tailed deer
, this would not extract "white".
If there are multiple colors in the same string (e.g. "black white bear") then I would recommend using str_extract_all()
.