I have been trying to reverse complement a fasta DNA sequence. Here is my code:
fastafile=open('sequence (3).fasta','r')
entries=[]
reverse=""
sequence=['A','T','G','C','N']
for line in fastafile:
if not line.startswith('>'):
line = line.split()
entries.append(line)
print entries
for index in range(0,len(entries[::-1])):
if index !=sequence:
print "this is not a valid nucleotide"
break
else:
if index=='A':
reverse+='T'
elif index=='T':
reverse+='A'
elif index=='C':
reverse+='G'
elif index=='G':
reverse+ 'C'
elif index=='N':
reverse+='N'
print reverse
And each time I get the output, this is not a valid nucleotide, even when my print of entries show that it has the items in sequence. Here is a sample of the output when I print enteries;
[['GCTCCCCTGAGGTTCGGCACCCACACTCCCTTCCCAGGAGCTCGCGATGCAAGAGCCACAGTCAGAGCTC'], ['AATATCGACCCCCCTCTGAGCCAGGAGACATTTTCAGAATTGTGGAACCTGCTTCCTGAAAACAATGTTC'], ['TGTCTTCGGAGCTGTGCCCAGCAGTGGATGAGCTGCTGCTCCCAGAGAGCGTCGTGAACTGGCTAGACGA']
How can I override this problem? I just want to add that I only started seriously programming with python about 2 months ago so I am still learning and improving. Thank you!
Your loop statement is:
for index in range(0,len(entries[::-1])):
This will iterate over the length of entries, that is, 0, 1, 2, 3, ..., len(entries)
.
When you do if index != sequence
you are actually comparing an integer with a list, say if 3 != ['A', 'C', 'T', 'G']
. I assume you can see that makes no sense. What you probably want to do is see if the nucleotide in the sequence is a valid nucleotide, ergo it is in the sequence
list. You can do it like this:
if entries[::-1][index] in sequence # Will be true if the nucleotide at entries[::-1][index] is inside sequence
Let me say two things:
First one, you don't have to range to len(entries[::-1])
, it is the same as len(entries)
Secondly, and more important, there is an actual module build specially for bioinformatics. It is called Biopython. It has special objects and functions. For example, your problem could be resolved as follows:
-
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print dna.reverse_complement()
Output: CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT