pythontopic-modelingmallet

How to automatically generate one or two words to represent a topic?


Mallet generates topics with top keywords. The keywords are unique for one topic. Is there an automatic way to select a certain word or several words from the topic keywords as the topic labeling. For example, 20 topic are generated from 500 articles. Each topic is set containing 20 words. One of the topics is:

topic id 12, weight 0.05879, (keywords) oil energy gas power water electricity nuclear industry sea climate price prices coal carbon emissions year fuel environmental green years

It seems I can have different interpretations of the topic. For example,

  1. energy problems from burning gas or oil or fuel
  2. generating water power to protect environment
  3. oil prices change because of climate change
  4. carbon emission causes environmental problems
  5. ...

One word label may be : energy, environment, oil, carbon emission, green energy...

Is there a way to generate only one or two words to represent this topic instead of subjectively and arbitrarily combining these words?

It seems the most important words are determined by frequency of terms in keyword algorithm. Mallet generates unique words for each topic.

My question: is there way to automatically select one most representative word or two words as the topic labeling?

I am new in topic modelling, will you help me?

Thanks


Solution

  • There are methods for automatically labeling topics, but I personally find that they aren't reliable enough to be not deceptive. As you noticed, there are often quite a few ways to describe the semantic content that has been identified by a topic, and many topics will not easily resolve to a single keyword or phrase.

    In practice, automatically extracted topics often combine multiple related themes (hydrocarbon industry and climate change here), or represent specific aspects of larger themes (eg there might be two topics with lots of words about education and classes, but one is just undergraduates and the other k-12). It's often difficult to recognize what a topic is really "about" without reading through documents that have large representation in that topic.

    For a lot of cases there is a pretty obvious "tag" (like "oil" in this case), but if you imply to users that a topic represents a specific concept, you will almost certainly find cases where that is not really a correct implication.