pythonscikit-learnnlptopic-modeling

Latent Dirichlet Allocation with prior topic words


Context

I'm trying to extract topics from a set of texts using Latent Dirichlet allocation from Scikit-Learn's decomposition module. This works really well, except for the quality of topic words found/selected.

In a article by Li et al (2017), the authors describe using prior topic words as input for the LDA. They manually choose 4 topics and the main words associated/belonging to these topics. For these words they set the default value to a high number for the associated topic and 0 for the other topics. All other words (not manually selected for a topic) are given equal values for all topics (1). This matrix of values is used as input for the LDA.

My question

How can I create a similar analysis with the LatentDirichletAllocation module from Scikit-Learn using a customized default values matrix (prior topics words) as input?

(I know there's a topic_word_prior parameter, but it only takes one float instead of a matrix with different 'default values'.)


Solution

  • Using Anis' help, I created a subclass of the original module, and edited the function that sets the starting values matrix. For all prior topic words you wish to give as input, it transforms the components_ matrix by multiplying the values with the topic values of that (prior) word.

    This is the code:

    # List with prior topic words as tuples
    # (word index, [topic values])
    prior_topic_words = []
    
    # Example (word at index 3000 belongs to topic with index 0)
    prior_topic_words.append(
        (3000, [(np.finfo(np.float64).max/4),0.,0.,0.,0.])
    )
    
    # Custom subclass for PTW-guided LDA
    from sklearn.utils import check_random_state
    from sklearn.decomposition._online_lda import _dirichlet_expectation_2d
    class PTWGuidedLatentDirichletAllocation(LatentDirichletAllocation):
    
        def __init__(self, n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=’batch’, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None, n_topics=None, ptws=None):
            super(PTWGuidedLatentDirichletAllocation, self).__init__(n_components, doc_topic_prior, topic_word_prior, learning_method, learning_decay, learning_offset, max_iter, batch_size, evaluate_every, total_samples, perp_tol, mean_change_tol, max_doc_update_iter, n_jobs, verbose, random_state, n_topics)
            self.ptws = ptws
    
        def _init_latent_vars(self, n_features):
            """Initialize latent variables."""
    
            self.random_state_ = check_random_state(self.random_state)
            self.n_batch_iter_ = 1
            self.n_iter_ = 0
    
            if self.doc_topic_prior is None:
                self.doc_topic_prior_ = 1. / self.n_topics
            else:
                self.doc_topic_prior_ = self.doc_topic_prior
    
            if self.topic_word_prior is None:
                self.topic_word_prior_ = 1. / self.n_topics
            else:
                self.topic_word_prior_ = self.topic_word_prior
    
            init_gamma = 100.
            init_var = 1. / init_gamma
            # In the literature, this is called `lambda`
            self.components_ = self.random_state_.gamma(
                init_gamma, init_var, (self.n_topics, n_features))
    
            # Transform topic values in matrix for prior topic words
            if self.ptws is not None:
                for ptw in self.ptws:
                    word_index = ptw[0]
                    word_topic_values = ptw[1]
                    self.components_[:, word_index] *= word_topic_values
    
            # In the literature, this is `exp(E[log(beta)])`
            self.exp_dirichlet_component_ = np.exp(
                _dirichlet_expectation_2d(self.components_))
    

    Initiation is the same as the original LatentDirichletAllocation class, but now you can provide prior topic words using the ptws parameter.