rtmcorpusterm-document-matrix

Use DocumentTermMatrix in R with 'dictionary' parameter


I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word:

library(tm)
crude <- "japan korea usa uk albania azerbaijan"
corps <- Corpus(VectorSource(crude))
dtm <- DocumentTermMatrix(corps)
inspect(dtm)

words <- c("australia", "korea", "uganda", "japan", "argentina", "turkey")
test <- DocumentTermMatrix(corps, control=list(dictionary = words))
inspect(test)

The first inspect(dtm) work as expected with result:

    Terms
Docs albania azerbaijan japan korea usa
   1       1          1     1     1   1

But the second inspect(test) show this result:

    Terms
Docs argentina australia japan korea turkey uganda
   1         0         1     0     1      0      0

While the expected result is:

    Terms
Docs argentina australia japan korea turkey uganda
   1         0         0     1     1      0      0

Is it a bug or I use it the wrong way ?


Solution

  • Corpus() seems to have a bug when indexing word frequency.

    Use VCorpus() instead, this will give you the expected result.