pythonbioinformaticsrosalind

Python: Multiple Consensus sequences


starting from a list of dna sequences, I must have in return all the possible consensus (the resulting sequence with the highest nucleotide frequency in each position) sequences. If in some positions the nucleotides have the same highest frequency, I must obtain all possible combinations with the highest frequency. I also must have in return the profile matrix ( a matrix with the frequencies of each nucleotide for each sequence).

This is my code so far (but it returns only one consensus sequence):

seqList = ['TTCAAGCT','TGGCAACT','TTGGATCT','TAGCAACC','TTGGAACT','ATGCCATT','ATGGCACT']
n = len(seqList[0])
profile = { 'T':[0]*n,'G':[0]*n ,'C':[0]*n,'A':[0]*n }

for seq in seqList:

    for i, char in enumerate(seq):
        profile[char][i] += 1



consensus = ""
for i in range(n):
    max_count = 0
    max_nt = 'x'
    for nt in "ACGT":
        if profile[nt][i] > max_count:
            max_count = profile[nt][i]
            max_nt = nt
    consensus += max_nt
print(consensus)
for key, value in profile.items():
     print(key,':', " ".join([str(x) for x in value] ))

TTGCAACT
C : 0 0 1 3 2 0 6 1
A : 2 1 0 1 5 5 0 0
G : 0 1 6 3 0 1 0 0
T : 5 5 0 0 0 1 1 6

(As you can see, in position four, C and G have the same highest score, it means I must obtain two consensus sequences)

Is it possible to modify this code to obtain all the possible sequences, or could you explain me the logic (the pseudocode) how to obtain the right result?

Thank you very much in advance!


Solution

  • I'm sure there are better ways but this is a simple one:

    bestseqs = [[]]
    for i in range(n):
        d = {N:profile[N][i] for N in ['T','G','C','A']}
        m = max(d.values())
        l = [N for N in ['T','G','C','A'] if d[N] == m]
        bestseqs = [ s+[N] for N in l for s in bestseqs ]
    
    for s in bestseqs:
        print(''.join(s))
    
    # output:
    ATGGAACT
    ATGCAACT