pythonapache-spark-mllibnaivebayesdocument-classification

Document classification in spark mllib


i want to classify documents if they belong to sports, entertainment, politics. i have created a bag of words which output somthing like :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want to implement naive bayes algorithm for classification using Spark mllib. My question is how to i convert this output into something that can naive bayes use as an input for classifcation like RDD or if there is any trick i can convert directly the html files into something that can be used by mllib naive bayes.


Solution

  • For text classification, you need:

    This sample is pretty straghtforward.