pythonbiopythonmarkov-models

What type is biopython 1.78 MarkovModel.train_visible() training_data?


I want to train a second-order Markov model for a nucleotide sequence using biopython's Bio.MarkovModel.train_visible(). That is, alphabet=["A","T","G","C"], states=["AA","AT","TT"...]

However, I get an error:

    474     states_indexes = itemindex(states)
    475     outputs_indexes = itemindex(alphabet)
--> 476     for toutputs, tstates in training_data:
    477         if len(tstates) != len(toutputs):
    478             raise ValueError("states and outputs not aligned")
 ValueError: too many values to unpack (expected 2)

Indicating that probably I give I've tried giving my training_data as a pair of lists:

training_data=(['A','T'...],['AA','AT'...])

and as zipped list of this list pair:

training_data=[('A','AA'),('T','AT')...]

but to no avail. What is the proper format of training_set? Thanks!


Solution

  • See the file test_MarkovModel.py for an example of expected input:

    >>> from Bio import MarkovModel
    
    >>> states = ["0", "1", "2", "3"]
    >>> alphabet = ["A", "C", "G", "T"]
    >>> training_data = [
                ("AACCCGGGTTTTTTT", "001112223333333"),
                ("ACCGTTTTTTT", "01123333333"),
                ("ACGGGTTTTTT", "01222333333"),
                ("ACCGTTTTTTTT", "011233333333"),
                ]
    >>> markov_model = MarkovModel.train_visible(states, alphabet, training_data)
    >>> states = MarkovModel.find_states(markov_model, "AACGTT")
    >>> print(states)
    [(['0', '0', '1', '2', '3', '3'], 0.008212890625000005)]