I'm trying to use Mallet
with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic 3) and classify with this results new document in future. I tried to use bigartm
for this first, but found nothing for classification in this program, only topic modeling. So Mallet, i created a corpus.txt file with following format:
Doc.num. \t(tab) Label(actualy 1 everywhere) \t Text
1 1 some text of document to classify
2 1 another doc text
...
For now I could get topics from this file after turning it to feature sequence format for mallet with
bin/mallet import-file --input corpus.txt --output foo.mallet--keep-sequence
and then get topics from it
bin/mallet train-topics --input foo.mallet --output-state state.gz --output-topic-keys topic-keys.txt --output-doc-topics doc-topics.txt
So general question now is what to use in mallet (train classifier?) to assign every existing document to a topic I found and to save this result to apply to future document that I want to classify with this topics.
Thanks
What you're looking for is described as "inference" in Mallet topic models. Training a classifier is a separate package, aimed at directly learning relationships between words and a pre-existing set of classes.
Here are directions for using inference on new documents:
When you train a model with the train-topics
command, add the --inferencer-filename [FILENAME]
option. This option will create a topic inference tool based on the current, trained model and save it in a file.
If you already have a trained model, for example from --output-state
or --output-model
you can initialize from that state or model, run 0 iterations of sampling, and output an inferencer.
Once you've created the inferencer file, use the MALLET command bin/mallet infer-topics --help
to get information on using topic inference.
Note that you must make sure that the new data is compatible with your training data. Otherwise word ID 425 might mean a completely different word. This will make all topics look equally probable. Use the option --use-pipe-from [MALLET TRAINING FILE]
in the MALLET command bin/mallet import-file
or import-dir
to specify a training file.