I'm trying to use LDA() from topicmodels package on a quite large data set. After trying everything to fix the following errors "In nr * nc : NAs produced by integer overflow" and "Each row of the input matrix needs to contain at least one non-zero entry", I ended up with this error.
ask<- read.csv('askreddit201508.csv', stringsAsFactors = F)
myDtm <- create_matrix(as.vector(ask$title), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)
myDtm2 = removeSparseTerms(myDtm,0.99999)
myDtm2 <- rollup(myDtm2, 2, na.rm=TRUE, FUN = sum)
rowTotals <- apply(myDtm2 , 1, sum)
myDtm2 <- myDtm2[rowTotals> 0, ]
LDA2 <- LDA(myDtm2,100)
Error in LDA(myDtm2, 100) :
The DocumentTermMatrix needs to have a term frequency weighting
Part of the problem is that you are weighting the document-term matrix by tf-idf, but LDA requires term counts. In addition, this method of removing sparse terms seems to be creating some documents where all terms have been removed.
Easier to get from your text to topic models using the quanteda package. Here's how:
require(quanteda)
myCorpus <- corpus(textfile("http://homepage.stat.uiowa.edu/~thanhtran/askreddit201508.csv",
textField = "title"))
myDfm <- dfm(myCorpus, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 160,707 documents
## ... indexing features: 39,505 feature types
## ... stemming features (English), trimmed 12563 feature variants
## ... created a 160707 x 26942 sparse dfm
## ... complete.
# remove infrequent terms: see http://stats.stackexchange.com/questions/160539/is-this-interpretation-of-sparsity-accurate/160599#160599
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.99999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
## Features occurring in fewer than 1.60707 documents: 12579
nfeature(myDfm2)
## [1] 14363
# fit the LDA model
require(topicmodels)
LDA2 <- LDA(quantedaformat2dtm(myDfm2), 100)