topic-modelingmallet

Mallet basic usage. First steps


I'm trying to use Mallet with literally no expirience in topic modeling and etc. My purpose is to get N topics of M documents that i have right now, classify every document with one or more topic (doc 1 = topic 1; doc 2 = topic 2 and possibly topic 3) and classify with this results new document in future. I tried to use bigartm for this first, but found nothing for classification in this program, only topic modeling. So Mallet, i created a corpus.txt file with following format:

Doc.num. \t(tab) Label(actualy 1 everywhere) \t Text 1 1 some text of document to classify 2 1 another doc text ...

For now I could get topics from this file after turning it to feature sequence format for mallet with

bin/mallet import-file --input corpus.txt --output foo.mallet--keep-sequence

and then get topics from it

bin/mallet train-topics --input foo.mallet --output-state state.gz --output-topic-keys topic-keys.txt --output-doc-topics doc-topics.txt

So general question now is what to use in mallet (train classifier?) to assign every existing document to a topic I found and to save this result to apply to future document that I want to classify with this topics.

Thanks


Solution

  • What you're looking for is described as "inference" in Mallet topic models. Training a classifier is a separate package, aimed at directly learning relationships between words and a pre-existing set of classes.

    Here are directions for using inference on new documents:

    When you train a model with the train-topics command, add the --inferencer-filename [FILENAME] option. This option will create a topic inference tool based on the current, trained model and save it in a file. If you already have a trained model, for example from --output-state or --output-model you can initialize from that state or model, run 0 iterations of sampling, and output an inferencer.

    Once you've created the inferencer file, use the MALLET command bin/mallet infer-topics --help to get information on using topic inference.

    Note that you must make sure that the new data is compatible with your training data. Otherwise word ID 425 might mean a completely different word. This will make all topics look equally probable. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.