rnlpclassificationtext-miningdocument-classification

Decision Trees For Document Classification


I wanted to know that is it possible to use decision trees for document classification, and if yes then how should the data representation be? I know the use of the R package party for Decision Trees.


Solution

  • One way is to have a huge matrix where each row is a document, and each column is a word. And the values in the cells are the number of times that word showed in that document.

    Then, if you are dealing with "supervised learning" case, you should have another column for the classifier, and from there on you can use a command like "rpart" (from the rpart package), to create your classification tree. The command would be entering a formula to rpart, in a similar fashion as you would to a linear model (lm).

    If you want, you could also try to first group your words to "groups of words", and then have each column belonging to a different group of words, with a number indication how many words in the document belonged to that group. For that I would have a look at the "tm" package. (If you end up doing something with that, please consider maybe posting about it here, so we could learn from it)