I am working on a persian classification project. Persian texts is very similar to arabic texts. when I use Tokenize, it does not show any word in its wordlist page and in Example Set Page, The Image below will be shown:
I need to classify persian text to some category, but I dont know How?.
I Follow some steps like this:
1- Read Excel(using Read Excel component) dataset with 2 column => col1:persian Text ,col2: Category
2- I use Set role component to labeling data
3- I use Process Documents from Data component containing :(Tokenize(with any mode not change anythings) and Filter Token(min:5,max:25) inside it)
4- Then I use Cross Validation Component to train with SVM or Basian and in test mode to get performance.
The program runs correctly and performance is not bad for e.g accuracy is 50% but I think my work is Wrong.
Any help would be appreciated.
first, make sure your text data have UTF-8 encoding and if u use filter tokens(by length) 5 is too much for minimum try 2 or at least 3 also, I recommend using Filter Stopwords (Dictionary) operator and the dictionary should have Persian stopwords in each line hope it will help u