rtmn-gramrweka

Text analysis with dictionary of words: NGramTokenizer not working


I am trying to look for a list of keywords in a text. Some of these keywords are n-grams. However, the TermDocumentMatrix will only find single words. I already had a look at several similar questions like this one (from which I borrowed the custom tokenizer function), this one and many more. However, none of the proposed solutions worked for me. I tried with both R 3.6.3 and R 4.1.2, with no success. Any ideas why?

Below is a minimal working example of my code:

library(RWeka) 
library(tm)

# List of keywords
my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")

text <- c("Just a sample text that contains the words I am looking for.",
          "Words such as cheese are detected by tm, but others like spicy salami",
          "or sweet chili sauce are not.")
  
# Create a corpus  
text_corpus <- VCorpus(VectorSource(text)) # Switched from Corpus to VCorpus as suggested in some of the solutions on stackoverflow
  
## Custom tokenizer function
myTokenizer <- function(x) {NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))}

matrix <- as.matrix(TermDocumentMatrix(text_corpus,
                                       list(control = list (tokenize = myTokenizer),
                                            dictionary = my_keywords,
                                            list(wordLengths=c(1, Inf))
                                       )
))
  
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words), freq=words)

Solution

  • A solution with only tm and NLP. No need for RWeka, as that uses rjava. Note that you had a mistake around the control portion of TermDocumentMatrix. You had list before control, but it should only be after control. And wordLength doesn't need a list, but should be in the control list like the other options.

    The tokenizer I created will creates tokens of length 1, 2 and 3. Otherwise "cheese" will not be picked up. Adjust the length as needed.

    library(tm)
    
    # List of keywords
    my_keywords <- c("cheese", "spicy salami", "sweet chili sauce")
    
    text <- c("Just a sample text that contains the words I am looking for.",
              "Words such as cheese are detected by tm, but others like spicy salami",
              "or sweet chili sauce are not.")
    
    # Create a corpus  
    text_corpus <- VCorpus(VectorSource(text)) 
    
    ## Custom tokenizer function
    myTokenizer <- function(x) {
      unlist(lapply(ngrams(words(x), 1:3), paste, collapse = " "), use.names = FALSE)
    }
    
    mat <- as.matrix(TermDocumentMatrix(text_corpus,
                                        control = list(tokenize = myTokenizer,
                                                       dictionary = my_keywords,
                                                       wordLength = c(1, Inf))
                                        ))
    
    mat
                       Docs
    Terms               1 2 3
      cheese            0 1 0
      spicy salami      0 1 0
      sweet chili sauce 0 0 1