machine-learningstatisticsdata-mininginferenceexpectation-maximization

iterated conditional mode E step EM


I wanted to know what is the mathematical justification for using ICM as an approximation for the E step in an EM algorithm.

As I understand in the E step the idea is to find a distribution that is equal to the posterior distribution of the latent variable, which guarantees that the likelihood increases or find the best possible distribution from some simpler family of distributions which guarantees that a lower bound of the likelihood functions increases.

How does one mathematically justify the use of ICM in such an E-step? Any reference/derivations/notes would be very helpful.


Solution

  • Let's consider a simple CRF which represent the likelihood of the labelling (y) given observation (x). Also assume likelihood depends on the parameter \theta. In the inference, you know only the x and trying to infer on y. What you simply do is applying EM algorithm in a way that E steps finds the labelling y (argmax P(y|x,\theta)) and M step finds the parameter \theta (argmax P(\theta|x,y)). M step can be accomplished by using any optimization algorithm because \theta is in general not high dimensional (at least not as high as dimension of y). E step is simply inference over an MRF/CRF having no hidden variable since \theta is independently optimized in M step. ICM is an algorithm which is used to perform inference. If you want a reference, you can simply read Murphy's book http://www.cs.ubc.ca/~murphyk/MLbook/, I think Chapter 26 is quite related.