I have a data frame with a bunch of text strings. In a second data frame I have a list of phrases that I'm using as a lookup table. I want to search the text strings for all possible phrase matches in the lookup table.
My problem is that some of the phrases have overlapping words. For example: "eggs" and "green eggs".
library(udpipe)
library(dplyr)
# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)
# Create example data
sample <- data.frame(doc_id = 1, text = "the cat in the hat ate green eggs and ham")
phrases <- data.frame(phrase = c("cat", "hat", "eggs", "green eggs", "ham", "the cat"))
# Tokenize text
x <- udpipe_annotate(ud_model, x = sample$text, doc_id = sample$doc_id)
x <- as.data.frame(x)
x$token <- tolower(x$token)
test_results <- x %>% select(doc_id, token)
test_results$term <- txt_recode_ngram(test_results$token,
compound = phrases$phrase,
ngram = str_count(phrases$phrase, '\\w+'),
sep = " ")
# Remove any tokens that don't match a phrase in the lookup table
test_results <- filter(test_results, term %in% phrases$phrase)
In the results you can see that "the cat" is returned but not "cat", "green eggs" but not "eggs".
> test_results$term
[1] "the cat" "hat" "green eggs" "ham"
How can I find all possible phrase matches between a text string and a lookup table?
I should add that I'm not wedded to any particular package. I'm just using udpipe here because I'm most familiar with it.
I think you can simply use grepl
to match if a string is inside another one. From that you apply
grepl
to all other matching patterns
# Create example data
sample <- data.frame(doc_id = 1, text = "the cat in the hat ate green eggs and ham")
phrases <- data.frame(phrase = c("cat", "hat", "eggs", "green eggs", "ham", "the cat"))
apply(phrases, 1, grepl,sample$text)
And if you want your matches you can just :
phrases[apply(phrases, 1, grepl,sample$text),]
But maybe a dataframe
type is not the most relevant for phrases