i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like :
(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')
i want to implement naive bayes algorithm for classification using Spark mllib. My question is how to i convert this output into something that can naive bayes use as an input for classifcation like RDD or if there is any trick i can convert directly the html files into something that can be used by mllib naive bayes.
For text classification, you need:
Label the document vectors:
doc_vec1 -> label1
doc_vec2 -> label2
...
This sample is pretty straghtforward.