rdata.tablecorpusterm-document-matrixqdap

More efficient means of creating a corpus and DTM with 4M rows


My file has over 4M rows and I need a more efficient way of converting my data to a corpus and document term matrix such that I can pass it to a bayesian classifier.

Consider the following code:

library(tm)

GetCorpus <-function(textVector)
{
  doc.corpus <- Corpus(VectorSource(textVector))
  doc.corpus <- tm_map(doc.corpus, tolower)
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
  doc.corpus <- tm_map(doc.corpus, stemDocument, "english")
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
  return(doc.corpus)
}

data <- data.frame(
  c("Let the big dogs hunt","No holds barred","My child is an honor student"), stringsAsFactors = F)

corp <- GetCorpus(data[,1])

inspect(corp)

dtm <- DocumentTermMatrix(corp)

inspect(dtm)

The output:

> inspect(corp)
<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
let big dogs hunt

[[2]]
<<PlainTextDocument (metadata: 7)>>
 holds bar

[[3]]
<<PlainTextDocument (metadata: 7)>>
 child honor stud
> inspect(dtm)
<<DocumentTermMatrix (documents: 3, terms: 9)>>
Non-/sparse entries: 9/18
Sparsity           : 67%
Maximal term length: 5
Weighting          : term frequency (tf)

              Terms
Docs           bar big child dogs holds honor hunt let stud
  character(0)   0   1     0    1     0     0    1   1    0
  character(0)   1   0     0    0     1     0    0   0    0
  character(0)   0   0     1    0     0     1    0   0    1

My question is, what can I use to create a corpus and DTM faster? It seems to be extremely slow if I use over 300k rows.

I have heard that I could use data.table but I am not sure how.

I have also looked at the qdap package, but it gives me an error when trying to load the package, plus I don't even know if it will work.

Ref. http://cran.r-project.org/web/packages/qdap/qdap.pdf


Solution

  • I think you may want to consider a more regex focused solution. These are some of the problems/thinking I'm wrestling with as a developer. I'm currently looking at the stringi package heavily for development as it has some consistently named functions that are wicked fast for string manipulation.

    In this response I'm attempting to use any tool I know of that is faster than the more convenient methods tm may give us (and certainly much faster than qdap). Here I haven't even explored parallel processing or data.table/dplyr and instead focus on string manipulation with stringi and keeping the data in a matrix and manipulating with specific packages meant to handle that format. I take your example and multiply it 100000x. Even with stemming, this takes 17 seconds on my machine.

    data <- data.frame(
        text=c("Let the big dogs hunt",
            "No holds barred",
            "My child is an honor student"
        ), stringsAsFactors = F)
    
    ## eliminate this step to work as a MWE
    data <- data[rep(1:nrow(data), 100000), , drop=FALSE]
    
    library(stringi)
    library(SnowballC)
    out <- stri_extract_all_words(stri_trans_tolower(SnowballC::wordStem(data[[1]], "english"))) #in old package versions it was named 'stri_extract_words'
    names(out) <- paste0("doc", 1:length(out))
    
    lev <- sort(unique(unlist(out)))
    dat <- do.call(cbind, lapply(out, function(x, lev) {
        tabulate(factor(x, levels = lev, ordered = TRUE), nbins = length(lev))
    }, lev = lev))
    rownames(dat) <- sort(lev)
    
    library(tm)
    dat <- dat[!rownames(dat) %in% tm::stopwords("english"), ] 
    
    library(slam)
    dat2 <- slam::as.simple_triplet_matrix(dat)
    
    tdm <- tm::as.TermDocumentMatrix(dat2, weighting=weightTf)
    tdm
    
    ## or...
    dtm <- tm::as.DocumentTermMatrix(dat2, weighting=weightTf)
    dtm