pythonhidden-markov-modelshmmlearn

How to create a transition matrix from dictionary with transitions and counts


I am trying to create an HMM and I want to create my transition matrix but i'm not sure how. I have a dictionary with transitions and the probability of those transitions occuring which looks as follows (only larger):

{(1, 2): 0.0035842293906810036, (2, 3): 0.0035842293906810036, (3, 4): 0.0035842293906810036, (4, 5): 0.0035842293906810036, (5, 6): 0.0035842293906810036, (6, 7): 0.0035842293906810036, (7, 8)}

which I defined as follows:

# create a list of bigrams
bigrams = []
for i in range(len(integer_list)):
    if i+1 in range(len(integer_list)):
        bigrams.append((integer_list[i], integer_list[i+1]))

# Create a dictionary containing the counts each bigram occurs in the dataset
bigrams_dict = Counter(bigrams)
values = bigrams_dict.values()

# create a dictionary containing the probability of a word occurring. <- initial probs
frequencies = {key:float(value)/sum(counts_dict.values()) for (key,value) in counts_dict.items()}
frequency_list = []
for value in frequencies.values():
    frequency_list.append(value)

Now I would like to make a transition matrix out of this, which would be a multi dimensional array, but I'm not sure how to do this. Could anyone please help me.

An example of what the transition matrix would look something like this (only with more states of course):


   0   1/3  2/3
   0   2/3  1/3
   1    0    0

Solution

  • The general procedure is just to pre-define a matrix of zeroes with the correct dimensions, then fill it one element at a time. Don't overthink this kind of task.

    For example, if you know that you have exactly 8 states, you can construct the matrix like this, using your frequencies dict:

    import numpy as np
    
    n_states = 8
    transitions = np.zeroes((n_states, n_states), dtype=np.float)
    
    for (state1, state2), probability in frequencies.items():
        transitions[state1, state2] = probability
    

    For a very large number of states, this could take a while, depending on how fast your computer is.

    If you do not know the total number of states, you can estimate it by computing the largest state number in your data:

    from itertools import chain
    
    n_states = max(chain.from_iterable(frequencies.keys()))