rstringdictionarygroup-bymatch

make groups by a dictionary in R


I have the task of identifying the group to which a sentence belongs according to the use of specific words, for example identifying which color has been used to describe an animal. I have a dictionary of the words I want to identify in this way:

df <- data.frame(id = c(1:5), pets = c("brown dog", "black cat", "orange cat", "black bird", "white hamster"))

dictionary <- c("black", "orange", "white", "brown", "green", "red")

I need to match the pets with the dictionary indicating which categories they are, to my final df be like:

final_df <- data.frame(id = c(1:5), 
pets = c("brown dog", "black cat", "orange cat", "black bird", "white hamster"), 
color = c("brown", "black", "orange", "black", "white"))

Solution

  • Using the stringr package:

    library(stringr)
    
    regex <- str_c("\\b", dictionary, "\\b", collapse = "|")
    df$color <- str_extract(df$pets, regex)
    # "brown"  "black"  "orange" "black"  "white" 
    

    In base R:

    regex <- paste0(".*(", paste0("\\b", dictionary, "\\b", collapse = "|"), ").*")
    
    df$color <- sub(regex, "\\1", df$pets)
    # "brown"  "black"  "orange" "black"  "white" 
    

    Both these solutions do the same thing. First construct a regular expression pattern match using dictionary. This regular expression will capture any of the complete words: black, orange, etc.

    Due to the use of word boundaries (e.g. \\b) this pattern match will only match these words. For example, if you had white-tailed deer, this would not extract "white".

    If there are multiple colors in the same string (e.g. "black white bear") then I would recommend using str_extract_all().