I have the following dictionary which contains all the possible codons (values|triplets) per amino acids (keys|letters). This dictionary is also known as the 'DNA codon table' in Bioinformatics.
codon_table = {
'A': ('GCT', 'GCC', 'GCA', 'GCG'),
'C': ('TGT', 'TGC'),
'D': ('GAT', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('TTT', 'TTC'),
'G': ('GGT', 'GGC', 'GGA', 'GGG'),
'H': ('CAT', 'CAC'),
'I': ('ATT', 'ATC', 'ATA'),
'K': ('AAA', 'AAG'),
'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
'M': ('ATG',),
'N': ('AAT', 'AAC'),
'P': ('CCT', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
'T': ('ACT', 'ACC', 'ACA', 'ACG'),
'V': ('GTT', 'GTC', 'GTA', 'GTG'),
'W': ('TGG',),
'Y': ('TAT', 'TAC'),}
I would like to create all the possible combinations of triplets for a given sequence of 'keys'. For instance the FMW sequence should have the following two results: TTTATGTGG and TTCATGTGG. The number of combinations should be the product of the number of values of each key in the dictionary. In our case for the FMW should be 2*1*1 = 2 combinations.
Which is the most pythonic and efficient way to do such calculations for sequences of 10 (and more) letters? Is there an already implemented method in any Biopython package?
Thanks in advance.
Assuming seq
here is the list of keys that you have. If you have it in any other form (like a string
) it can easily be treated as a char
array and broken down into the seq
list. Once you do that, itertools
does a wonderful job doing exactly what you want. Here is the full code -
import itertools
codon_table = {
'A': ('GCT', 'GCC', 'GCA', 'GCG'),
'C': ('TGT', 'TGC'),
'D': ('GAT', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('TTT', 'TTC'),
'G': ('GGT', 'GGC', 'GGA', 'GGG'),
'H': ('CAT', 'CAC'),
'I': ('ATT', 'ATC', 'ATA'),
'K': ('AAA', 'AAG'),
'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
'M': ('ATG',),
'N': ('AAT', 'AAC'),
'P': ('CCT', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
'T': ('ACT', 'ACC', 'ACA', 'ACG'),
'V': ('GTT', 'GTC', 'GTA', 'GTG'),
'W': ('TGG',),
'Y': ('TAT', 'TAC'),}
seq = ['F', 'M', 'W']
t = [ list(codon_table[key]) for key in seq ]
print(list(itertools.product(*t)))
Output
[('TTT', 'ATG', 'TGG'), ('TTC', 'ATG', 'TGG')]
OP Output
Further, if you want the output exactly as you wanted, instead of the last print statement, you can go for -
output = list(itertools.product(*t))
print(output)
output_flat = [ ''.join(a) for a in output ]
print(output_flat)
This outputs -
['TTTATGTGG', 'TTCATGTGG']
Hope that helps. Cheers!