training-dataopennlpcategorization

Dropped event message OpenNLP. Training data is dropped in OpenNLP


I have labeled data (label and text), like this:

category1, "train message 1"
category1, "train message 2"
category1, "train message 3"
category2, "train message 4"
category2, "train messsage 5"

I try to train my categorize model with Java OpenNLP library.

DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, customFactory);

When i training model, i get strange messages:

**Indexing events using cutoff of 5**
**Computing event counts...  done. 5441 events**
Dropped event animals*:[bow=live, bow=animals, ng=:live:animals]
Dropped event animals*:[bow=aquariums]
Dropped event animals*:[bow=aquatic, bow=plant, bow=fertilizers, ng=:aquatic:plant,ng=:aquatic:plant:fertilizers, ng=:plant:fertilizers]
Dropped event apparel*:[bow=activewear]
Dropped event apparel*:[bow=one, bow=pieces, ng=:one:pieces]

Why does it mean Dropped event "category": [....]?**


Solution

  • I added custom factory, it work

    int minNgramSize = 2;
    int maxNgramSize = 3;
    DoccatFactory customFactory = new DoccatFactory(new FeatureGenerator[]{
                new BagOfWordsFeatureGenerator(),
                new NGramFeatureGenerator(minNgramSize, maxNgramSize)
                });
    DoccatModel model = DocumentCategorizerME.train("pt", sampleStream, params, customFactory);