I have labeled data (label and text), like this:
category1, "train message 1"
category1, "train message 2"
category1, "train message 3"
category2, "train message 4"
category2, "train messsage 5"
I try to train my categorize model with Java OpenNLP library.
DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, customFactory);
When i training model, i get strange messages:
**Indexing events using cutoff of 5**
**Computing event counts... done. 5441 events**
Dropped event animals*:[bow=live, bow=animals, ng=:live:animals]
Dropped event animals*:[bow=aquariums]
Dropped event animals*:[bow=aquatic, bow=plant, bow=fertilizers, ng=:aquatic:plant,ng=:aquatic:plant:fertilizers, ng=:plant:fertilizers]
Dropped event apparel*:[bow=activewear]
Dropped event apparel*:[bow=one, bow=pieces, ng=:one:pieces]
Why does it mean Dropped event "category": [....]?**
I added custom factory, it work
int minNgramSize = 2;
int maxNgramSize = 3;
DoccatFactory customFactory = new DoccatFactory(new FeatureGenerator[]{
new BagOfWordsFeatureGenerator(),
new NGramFeatureGenerator(minNgramSize, maxNgramSize)
});
DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, customFactory);