rnlptokenizetmn-gram

Create Document Term Matrix with N-Grams in R


I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package. But im not able to create it.

I searched for possible solution but i didnt get much help. For privacy reasons i can not share the data. Here is what i have tried,

library(tm)  
library(tokenizers)

data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"

data_corpus = Corpus(DataframeSource(data))

custom function for n-gram tokenization :

ngram_tokenizer = function(x){
  temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
  return(temp)
}

control list for DTM creation :
1-gram

control_list_unigram = list(tokenize = "words",
                          removePunctuation = FALSE,
                          removeNumbers = FALSE, 
                          stopwords = stopwords("english"), 
                          tolower = T, 
                          stemming = T, 
                          weighting = function(x)
                            weightTf(x)
)

for N-gram tokenization

control_list_ngram = list(tokenize = ngram_tokenizer,
                    removePunctuation = FALSE,
                    removeNumbers = FALSE, 
                    stopwords = stopwords("english"), 
                    tolower = T, 
                    stemming = T, 
                    weighting = function(x)
                      weightTf(x)
                    )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

The dimension of both the dtm's were same.
Please correct me!


Solution

  • Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.

    So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).

    That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:

    Input must be a character vector of any length or a list of character
      vectors, each of which has a length of 1. 
    

    when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

    To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.

    NLP_tokenizer <- function(x) {
      unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
    }
    

    This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:

    control_list_ngram = list(tokenize = NLP_tokenizer,
                              removePunctuation = FALSE,
                              removeNumbers = FALSE, 
                              stopwords = stopwords("english"), 
                              tolower = T, 
                              stemming = T, 
                              weighting = function(x)
                                weightTf(x)
                              )
    

    Personally I would use the quanteda package for all of this work. But for now this should help you.