I am trying to look for a list of keywords in a text. Some of these keywords are n-grams. However, the TermDocumentMatrix will only find single words. I already had a look at several similar questions like this one (from which I borrowed the custom tokenizer function), this one and many more. However, none of the proposed solutions worked for me. I tried with both R 3.6.3 and R 4.1.2, with no success. Any ideas why?
Below is a minimal working example of my code:
library(RWeka)
library(tm)
# List of keywords
my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")
text <- c("Just a sample text that contains the words I am looking for.",
"Words such as cheese are detected by tm, but others like spicy salami",
"or sweet chili sauce are not.")
# Create a corpus
text_corpus <- VCorpus(VectorSource(text)) # Switched from Corpus to VCorpus as suggested in some of the solutions on stackoverflow
## Custom tokenizer function
myTokenizer <- function(x) {NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))}
matrix <- as.matrix(TermDocumentMatrix(text_corpus,
list(control = list (tokenize = myTokenizer),
dictionary = my_keywords,
list(wordLengths=c(1, Inf))
)
))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words), freq=words)
A solution with only tm and NLP. No need for RWeka, as that uses rjava. Note that you had a mistake around the control
portion of TermDocumentMatrix
. You had list before control, but it should only be after control. And wordLength
doesn't need a list, but should be in the control list like the other options.
The tokenizer I created will creates tokens of length 1, 2 and 3. Otherwise "cheese" will not be picked up. Adjust the length as needed.
library(tm)
# List of keywords
my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")
text <- c("Just a sample text that contains the words I am looking for.",
"Words such as cheese are detected by tm, but others like spicy salami",
"or sweet chili sauce are not.")
# Create a corpus
text_corpus <- VCorpus(VectorSource(text))
## Custom tokenizer function
myTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = " "), use.names = FALSE)
}
mat <- as.matrix(TermDocumentMatrix(text_corpus,
control = list(tokenize = myTokenizer,
dictionary = my_keywords,
wordLength = c(1, Inf))
))
mat
Docs
Terms 1 2 3
cheese 0 1 0
spicy salami 0 1 0
sweet chili sauce 0 0 1