pythonpython-3.xcombinationscombinatoricsbiopython

Calculate all the possible combinations from a given sequence (in python)


I have the following dictionary which contains all the possible codons (values|triplets) per amino acids (keys|letters). This dictionary is also known as the 'DNA codon table' in Bioinformatics.

codon_table = {
'A': ('GCT', 'GCC', 'GCA', 'GCG'),
'C': ('TGT', 'TGC'),
'D': ('GAT', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('TTT', 'TTC'),
'G': ('GGT', 'GGC', 'GGA', 'GGG'),
'H': ('CAT', 'CAC'),
'I': ('ATT', 'ATC', 'ATA'),
'K': ('AAA', 'AAG'),
'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
'M': ('ATG',),
'N': ('AAT', 'AAC'),
'P': ('CCT', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
'T': ('ACT', 'ACC', 'ACA', 'ACG'),
'V': ('GTT', 'GTC', 'GTA', 'GTG'),
'W': ('TGG',),
'Y': ('TAT', 'TAC'),}

I would like to create all the possible combinations of triplets for a given sequence of 'keys'. For instance the FMW sequence should have the following two results: TTTATGTGG and TTCATGTGG. The number of combinations should be the product of the number of values of each key in the dictionary. In our case for the FMW should be 2*1*1 = 2 combinations.

Which is the most pythonic and efficient way to do such calculations for sequences of 10 (and more) letters? Is there an already implemented method in any Biopython package?

Thanks in advance.


Solution

  • Assuming seq here is the list of keys that you have. If you have it in any other form (like a string) it can easily be treated as a char array and broken down into the seq list. Once you do that, itertools does a wonderful job doing exactly what you want. Here is the full code -

    import itertools
    codon_table = {
    'A': ('GCT', 'GCC', 'GCA', 'GCG'),
    'C': ('TGT', 'TGC'),
    'D': ('GAT', 'GAC'),
    'E': ('GAA', 'GAG'),
    'F': ('TTT', 'TTC'),
    'G': ('GGT', 'GGC', 'GGA', 'GGG'),
    'H': ('CAT', 'CAC'),
    'I': ('ATT', 'ATC', 'ATA'),
    'K': ('AAA', 'AAG'),
    'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
    'M': ('ATG',),
    'N': ('AAT', 'AAC'),
    'P': ('CCT', 'CCC', 'CCA', 'CCG'),
    'Q': ('CAA', 'CAG'),
    'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
    'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
    'T': ('ACT', 'ACC', 'ACA', 'ACG'),
    'V': ('GTT', 'GTC', 'GTA', 'GTG'),
    'W': ('TGG',),
    'Y': ('TAT', 'TAC'),}
    
    seq = ['F', 'M', 'W']
    t = [ list(codon_table[key]) for key in seq ]
    print(list(itertools.product(*t)))
    

    Output

    [('TTT', 'ATG', 'TGG'), ('TTC', 'ATG', 'TGG')]
    

    OP Output

    Further, if you want the output exactly as you wanted, instead of the last print statement, you can go for -

    output = list(itertools.product(*t))
    print(output)
    
    output_flat = [ ''.join(a) for a in output ]
    print(output_flat)
    

    This outputs -

    ['TTTATGTGG', 'TTCATGTGG']
    

    Hope that helps. Cheers!