machine-learning hidden-markov-models anomaly-detection

How to solve basic HMM problems with hmmlearn

There are three fundamental HMM problems:

Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation sequence O, determine the likelihood P(O|λ).

Problem 2 (Decoding): Given an observation sequence O and an HMM λ = (A,B), discover the best hidden state sequence Q.

Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B.

I'm interested in Problems ## 1 and 3. In general, first problem can be solved with Forward Algorithm and third problem can be solved with Baum-Welch Algorithm. Am I right that I should use fit(X, lengths) and score(X, lengths) methods from hmmlearn for solving first and third problems respectively? (The documentation does not say that score uses Forward Algorithm.)

And I have some more questions about score method. Why score calculates log probability? And why if I pass several sequences to score it returns sum of log probabilities instead of probability for each sequence?

My original task was the following: I have 1 million short sentences of equal size (10 words). I want to train HMM model with that data and for test data (10-words sentences again) predict the probability of each sentence in the model. And based on that probability I will decide is that usual or unusual phrase.

And maybe there is better libraries for python to solve that problems?

Solution

If you are fitting the model on a single sequence, you should use score(X) and fit(X) to solve the first and third problem, respectively (since length = None is the default value, you do not need to pass it explicitly). When working with multiple sequences, you should pass the list of their lengths as the lengths parameter, see the documentation.

The score method calculates log probability for numerical stability. Multiplying a lot of numbers can result in numerical overflow or underflow - i.e. a number may get too big to store in memory or too small to distinguish from zero. The solution is to add their logarithms instead.

The score method returns the sum of logprobabilities of all sequences, well, because that is how it is implemented. The feature request for the feature you want has been submitted a month ago, though, so maybe it will appear soon. https://github.com/hmmlearn/hmmlearn/issues/272 Or you can simply score each sequence separately.