I'm using the code below to fit an HMM model with two hidden states, a vocabulary of size 5 (so 5 possible symbols), and a list of sequences, each with 10 observations.
I don't understand why model.emissionprob_
has size (2, 10)
, when the number of columns should be equal to the number of symbols available in the vocabulary, 5 in this case.
from hmmlearn import hmm
import numpy as np
import pandas as pd
# Define the sequences
sequences = np.random.randint(1, 6, size=(100000, 10)).tolist()
# Convert sequences to numpy array
sequences_np = np.array(sequences)
# Create and fit the Multinomial HMM model with 2 hidden states
model = hmm.MultinomialHMM(n_components=2)
model.fit(sequences_np)
# Print the model parameters
print("Initial state distribution:")
print(model.startprob_)
print("\nTransition matrix:")
print(model.transmat_)
print("\nEmission probabilities:")
print(model.emissionprob_)
Short answer: it sounds like you might need CategoricalHMM
instead of MultinomialHMM
. hmmlearn used to call the former as the latter (https://github.com/hmmlearn/hmmlearn/issues/335). You might also want to use values between 0 and 4 (rather than 1 and 5) if the outcome can take five values, otherwise CategoricalHMM creates a category for 0.
I describe the difference between the two models below.
The multinomial distribution models the number of each possible outcome in n trials. The standard example is to count the number of times each side comes up when throwing a die n times. In this example, if n = 10, the observations might be
(2, 2, 2, 0, 0, 4)
(1, 2, 3, 1, 3, 0)
(1, 0, 2, 4, 2, 1)
(4, 0, 0, 2, 1, 3)
(2, 1, 1, 0, 2, 4)
...
The first observation tells us that, in the n = 10 die throws, the sides 1, 2 and 3 each came up two times, the sides 4 and 5 didn't come up, and 6 came up four times. Note that each observation is a sequence of 6 counts, one for each possible outcome (i.e., each side of the die), and that the counts add to 10 (the number of trials/throws) in each observation.
The parameters of interest in this model are usually the probabilities of the different possible outcomes. In the example above, those were just (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
if the die is balanced, and we could estimate these from data (i.e., from the observations above). In the HMM context, the assumption is that those probabilities are determined by the state of the hidden process at each time point.
This explains the results you are getting when using MultinomialHMM
on the data you simulated, which have the form
(4, 1, 3, 3, 4, 3, 2, 1, 2, 4)
(1, 1, 3, 5, 2, 1, 3, 4, 2, 2)
(3, 3, 5, 1, 5, 4, 2, 4, 5, 1)
(3, 1, 5, 2, 3, 2, 4, 5, 3, 1)
hmmlearn interprets these as counts for 10 possible outcomes (the length of each observation vector), and tries to estimate the probability of each outcome in each state. This is why it returns two vectors of 10 probabilities in your example. Because you generated the values randomly, it estimates all probabilities to be pretty close to 0.1.
The categorical distribution models a variable that can take on several (qualitative) values with some probabilities. It can be viewed as a special case of the multinomial distribution where the number of trials (n) is equal to 1.
In the die example, each observation would just be the outcome of one throw:
4
4
3
5
1
...
Note that, in contrast with the multinomial example above, the numbers represent outcomes (indexed by integers out of convenience), rather than counts. The number of possible outcomes corresponded to the length of each observation vector in the multinomial example, whereas here you would look at how many unique values show up in the sequence of observations.
The parameters of this model are also probabilities for the different outcomes, i.e., (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
in the die example. In a categorical HMM, we assume that these probabilities are driven by a state process, and so some outcomes might be more frequent in state 1 and other outcomes more frequent in state 2.
Using the code from your question, but replacing MultinomialHMM
by CategoricalHMM
(and simulating values between 0 and 4), the emission probabilities estimated by hmmlearn become:
>>> print(model.emissionprob_)
[[0.19290415 0.20328402 0.19841795 0.20260928 0.2027846 ]
[0.58753269 0.01360612 0.30773709 0.06664846 0.02447564]]
As expected, we get two vectors of 5 probabilities: the probabilities of the five outcomes in the two states.