I want to use R for text classification. I use DocumentTermMatrix to return the matrix of word:
library(tm)
crude <- "japan korea usa uk albania azerbaijan"
corps <- Corpus(VectorSource(crude))
dtm <- DocumentTermMatrix(corps)
inspect(dtm)
words <- c("australia", "korea", "uganda", "japan", "argentina", "turkey")
test <- DocumentTermMatrix(corps, control=list(dictionary = words))
inspect(test)
The first inspect(dtm)
work as expected with result:
Terms
Docs albania azerbaijan japan korea usa
1 1 1 1 1 1
But the second inspect(test)
show this result:
Terms
Docs argentina australia japan korea turkey uganda
1 0 1 0 1 0 0
While the expected result is:
Terms
Docs argentina australia japan korea turkey uganda
1 0 0 1 1 0 0
Is it a bug or I use it the wrong way ?
Corpus() seems to have a bug when indexing word frequency.
Use VCorpus() instead, this will give you the expected result.