python algorithm machine-learning nlp expectation-maximization

Baum Welch (EM Algorithm) likelihood (P(X)) is not monotonically converging

So I am sort of an amateur when comes to machine learning and I am trying to program the Baum Welch algorithm, which is a derivation of the EM algorithm for Hidden Markov Models. Inside my program I am testing for convergence using the probability of each observation sequence in the new model and then terminating once the new model is less than or equal to the old model. However, when I run the algorithm it seems to converge somewhat and gives results that are far better than random but when converging it goes down on the last iteration. Is this a sign of a bug or am I doing something wrong?

It seems to me that I should have been using the summation of the log of each observation's probability for the comparison instead since it seems like the function I am maximizing. However, the paper I read said to use the log of the sum of probabilities(which I am pretty sure is the same as the sum of the probabilities) of the observations(https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf).

I fixed this on another project where I implemented backpropogation with feed-forward neural nets by implementing a for loop with pre-set number of epochs instead of a while loop with a condition for the new iteration to be strictly greater than but I am wondering if this is a bad practice.

My code is at https://github.com/icantrell/Natural-Language-Processing inside the nlp.py file.

Any advice would be appreciated. Thank You.

Solution

For EM iterations, or any other iteration proved to be non-decreasing, you should be seeing increases until the size of increases becomes small compared with floating point error, at which time floating point errors violate the assumptions in the proof, and you may see not only a failure to increase, but a very small decrease - but this should only be very small.

One good way to check these sorts of probability based calculations is to create a small test problem where the right answer is glaringly obvious - so obvious that you can see whether the answers from the code under test are obviously correct at all.

It might be worth comparing the paper you reference with https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm#Proof_of_correctness. I think equations such as (11) and (12) are not intended for you to actually calculate, but as arguments to motivate and prove the final result. I think the equation corresponding to the traditional EM step, which you do calculate, is equation (15) which says that you change the parameters at each step to increase the expected log-likelihood, which is the expectation under the distribution of hidden states calculated according to the old parameters, which is the standard EM step. In fact, turning over I see this is stated explicitly at the top of P 8.