pythonstringbioinformaticsrosalind

Python: Rosalind Consensus and Profile


I am trying to solve the "Consensus and Profile" challenge on Rosalind. The challenge instructions are as follows:

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

My code is as follows (I got most of it from another user on this website). My only issue is that some of the DNA strands are broken down into multiple separate lines, so they are being appended to the "allstrings" list as separate strings. I am trying to figure out how to write each consecutive line that does not contain ">" as a single string.

import numpy as np

seq = []
allstrings = []
temp_seq = []
matrix = []
C = []
G = []
T = []
A = []
P = []
consensus = []
position = 1

file = open("C:/Users/knigh/Documents/rosalind_cons (3).txt", "r")
conout = open("C:/Users/knigh/Documents/consensus.txt", "w")

# Right now, this is reading and writing each as an individual line. Thus, it
#  is splitting each sequence into multiple small sequences. You need to figure
#  out how to read this in FASTA format to prevent this from occurring
desc = file.readlines()

for line in desc:
    allstrings.append(line)

for string in range(1, len(allstrings)):
    if ">" not in allstrings[string]:
        temp_seq.append(allstrings[string])
    else:
        seq.insert(position, temp_seq[0])
        temp_seq = []
        position += 1

# This last insertion into the sequence must be performed after the loop to empty
#  out the last remaining string from temp_seq
seq.insert(position, temp_seq[0])

for base in seq:
    matrix.append([pos for pos in base])

M = np.array(matrix).reshape(len(seq), len(seq[0]))

for base in range(len(seq[0])):
    A_count = 0
    C_count = 0
    G_count = 0
    T_count = 0
    for pos in M[:, base]:
        if pos == "A":
            A_count += 1
        elif pos == "C":
            C_count += 1
        elif pos == "G":
            G_count += 1
        elif pos == "T":
            T_count += 1
    A.append(A_count)
    C.append(C_count)
    G.append(G_count)
    T.append(T_count)

profile_matrix = {"A": A, "C": C, "G": G, "T": T}

P.append(A)
P.append(C)
P.append(G)
P.append(T)

profile = np.array(P).reshape(4, len(A))

for pos in range(len(A)):
    if max(profile[:, pos]) == profile[0, pos]:
        consensus.append("A")
    elif max(profile[:, pos]) == profile[1, pos]:
        consensus.append("C")
    elif max(profile[:, pos]) == profile[2, pos]:
        consensus.append("G")
    elif max(profile[:, pos]) == profile[3, pos]:
        consensus.append("T")

conout.write("".join(consensus) + "\n")

for k, v in profile_matrix.items():
    conout.write(k + ": " + " ".join(str(x) for x in v) + "\n")

conout.close()

Solution

  • There are a couple of ways that you can iterate a FASTA file as records. You can use a prebuilt library or write your own.

    A widely used library for working with sequence data is biopython. This code snippet will create a list of strings.

    from Bio import SeqIO
    
    
    file = "path/to/your/file.fa"
    sequences = []
    
    with open(file, "r") as file_handle:
        for record in SeqIO.parse(file_handle, "fasta"):
            sequences.append(record.seq)
    

    Alternatively, you can write your own FASTA parser. Something like this should work:

    def read_fasta(fh):
        # Iterate to get first FASTA header        
        for line in fh:
            if line.startswith(">"):
                name = line[1:].strip()
                break
    
        # This list will hold the sequence lines
        fa_lines = []
    
        # Now iterate to find the get multiline fasta
        for line in fh:
            if line.startswith(">"):
                # When in this block we have reached 
                #  the next FASTA record
    
                # yield the previous record's name and
                #  sequence as tuple that we can unpack
                yield name, "".join(fa_lines)
    
                # Reset the sequence lines and save the
                #  name of the next record
                fa_lines = []
                name = line[1:].strip()
    
                # skip to next line
                continue
    
            fa_lines.append(line.strip())
    
        yield name, "".join(fa_lines)
    

    You can use this function like so:

    file = "path/to/your/file.fa"
    sequences = []
    
    with open(file, "r") as file_handle:
        for name, seq in read_fasta(file_handle):
            sequences.append(seq)