r tm lda

Text Analysis Using LDA and tm in R

Hey guys I have a little bit of trouble conduction LDA because for some reason once I get ready to conduct the analysis I get errors. I'll do my best to go through what I am doing unfortunately I will not be able to provide data because the data I am using is proprietary data.

dataset <- read.csv("proprietarydata.csv")

First I do a little bit of cleaning data$text and post are class character

dataset$text <- as.character(dataset$text) 
post <- gsub("[^[:print:]]"," ",data$Post.Content)
post <- gsub("[^[:alnum:]]", " ",post)

post ends up looking like this: `

`[1] "here is a string"
 [2] "here is another string"
 etc....`

then I created the following function which does more cleaning:

createdtm <- function(x){
myCorpus <- Corpus(VectorSource(x))
myCorpus <- tm_map(myCorpus,PlainTextDocument)
docs <- tm_map(myCorpus,tolower)
docs <- tm_map(docs, removeWords, stopwords(kind="SMART"))
docs <- tm_map(docs, removeWords, c("the"," the","will","can","regards","need","thanks","please","http"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
return(docs)}

predtm <- createdtm(post)

This end up returning a corpus that gives me something like this for every document:

[[1]]
<<PlainTextDocument (metadata: 7)>>
Here text string


[[2]]
<<PlainTextDocument (metadata: 7)>>
Here another string

Then I set myself up to get ready for LDA by creating a DocumentTermMatrix

dtm <- DocumentTermMatrix(predtm)
inspect(dtm)


<<DocumentTermMatrix (documents: 14640, terms: 39972)>>
Non-/sparse entries: 381476/584808604
Sparsity           : 100%
Maximal term length: 86
Weighting          : term frequency (tf)

Docs           truclientrre truddy trudi trudy true truebegin truecontrol
              Terms
Docs           truecrypt truecryptas trueimage truely truethis trulibraryref
              Terms
Docs           trumored truncate truncated truncatememory truncates
              Terms
Docs           truncatetableinautonomoustrx truncating trunk trunkhyper
              Terms
Docs           trunking trunkread trunks trunkswitch truss trust trustashtml
              Terms
Docs           trusted trustedbat trustedclient trustedclients
              Terms
Docs           trustedclientsjks trustedclientspwd trustedpublisher
              Terms
Docs           trustedreviews trustedsignon trusting trustiv trustlearn
              Terms
Docs           trustmanager trustpoint trusts truststorefile truststorepass
              Terms
Docs           trusty truth truthfully truths tryd tryed tryig tryin tryng

This looks really odd to me but this is how I have always done this. So I end up moving forward with this and do the following

run.lda <- LDA(dtm,4)

This returns my first error

  Error in LDA(dtm, 4) : 
  Each row of the input matrix needs to contain at least one non-zero entry

After researching this error I check out this post Remove empty documents from DocumentTermMatrix in R topicmodels? I assume I have everything under control and get excited so I follow the steps in the link but then

This runs

rowTotals <- apply(dtm , 1, sum)

This doesnt

dtm.new   <- dtm[rowTotals> 0]

it returns:

  Error in `[.simple_triplet_matrix`(dtm, rowTotals > 0) : 
  Logical vector subscripting disabled for this object.

I know I might get heat because some of you might say this isn't reproducible example. Please feel free to ask anything about this problem. It's the best I can do.

Solution

Here's what an appropriate minimal reproducible example should look like

library(tm)
library(topicmodels)
raw <- c("hello","","goodbye")
tm <- Corpus(VectorSource(raw))

dtm <- DocumentTermMatrix(tm)

LDA(dtm,4)

# Error in LDA(dtm, 4) : 
#   Each row of the input matrix needs to contain at least one non-zero entry

Note that the proper way subset a matrix is by specifying [row,col] not just [index].

rowTotals <- apply(dtm , 1, sum)
dtm <- dtm[rowTotals>0,]
LDA(dtm, 4)

#A LDA_VEM topic model with 4 topics.

Please take the time to create reproducible examples. Often in doing so you discover your own error and can easily fix it. At the very least, it will help others see the problem more clearly and eliminate unnecessary info.