[SOLVED] Which hyperparameter optimization technique is used in Mallet for LDA?

Which hyperparameter optimization technique is used in Mallet for LDA?

I am wondering which technique is used to learn the Dirichlet priors in Mallet's LDA implementation.

Chapter 2 of Hanna Wallach's Ph.D. thesis gives a great overview and a valuable evaluation of existing and new techniques to learn the Dirichlet priors from the data.

Tom Minka initially provided his famous fixed-point iteration approach, however without any evaluation or recommendations.

Furthermore, Jonathan Chuang did some comparisons between previously proposed methods, including the Newton−Raphson method.

LiangJie Hong says the following in his blog:

A typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package.

Mallet mentions the Minka's fixed-point iterations with and without histograms.

However, the method that is actually used simply states:

Learn Dirichlet parameters using frequency histograms

Could someone provide any reference that describes the used technique?

Solution

It uses the fixed point iteration. The frequency histograms method is just an efficient way to calculate it. They provide an algebraically equivalent way to do the exact same computation. The update function consists of a sum over a large number of Digamma functions. This function by itself is difficult to compute, but the difference between two Digamma functions (where the arguments differ by an integer) is relatively easy to compute, and even better, it "telescopes" so that the answer to Digamma(a + n) - Digamma(a) is one operation away from the answer to Digamma(a + n + 1) - Digamma(a). If you work through the histogram of counts from 1 to the max, adding up the number of times you saw a count of n at each step, the calculation becomes extremely fast. Initially, we were worried that hyperparameter optimization would take so long that no one would do it. With this trick it's so fast it's not really significant compared to the Gibbs sampling.