I am trying to combine some protein sequences in fasta format and then remove duplicates. I found this code by searching and it works well enough but I ran into an issue that I couldn't understand. Here is the example sequence which is causing the error:
>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR
The original file and sequences are long so I shortened it for ease.
I found this code on the forum which works fine and writes a new file without duplicates:
from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
for record in SeqIO.parse("Prob2.fa", "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()
print(f"Run time is {(end- start)/60}")
Now, the Python interpreter is giving me this error:
Traceback (most recent call last):
File "C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py", line 10, in <module>
for record in SeqIO.parse("Prob2.fa", "fasta"):
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 72, in __next__
return next(self.records)
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 238, in iterate
for title, sequence in SimpleFastaParser(handle):
File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 50, in SimpleFastaParser
for line in handle:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 449: illegal multibyte sequence
I found the problem is in the header with "-" character written as "- in claim" (the third sequence in list). If I remove that it works fine, but there are other "-" mentioned in other sequence headers as well. I found it by removing half of the sequences and checking if it still gives an error. Now, if I delete this "-" and type a new "-", it works fine. So I am just trying to understand what is the real problem here. So I can write in the correct input format in the future.
I originally wrote these sequences in Word, and later edit them in Notepad++ and save it as ".fa" file.
Secondly, I want to find out how many duplicates were found and mention the record IDs/headers. So if someone can help me with what lines of codes I should insert, I will be very grateful.
OK my attempt , cannot reproduce your error. But using your same input:
>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone - in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR
try with the following code:
from Bio import SeqIO
import time
start = time.time()
seen = []
records = []
filename = 'Prob2.fa'
with open(filename, 'r', encoding='utf-8') as f:
for record in SeqIO.parse(f, "fasta"):
if str(record.seq) not in seen:
seen.append(str(record.seq))
records.append(record)
#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()
print(f"Run time is {(end- start)/60}")
let us know if it is working.
I can reproduce your error using in my code:
with open(filename, 'r', encoding='gbk') as f:
adding the char : 丆
to one of your headers
but I dont get the error anymore if I delete the 丆
from the fasta header
As Poshi pointed out:
This looks like an encoding issue. Not sure why the data is being decoded with the GBK decoder.
SEE https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559 for explanation about :
how to feed data to SEqIO.parse(..
:
Arguments: - handle - handle to the file, or the filename as a string
.......
If you have a string 'data' containing the file contents, you must first turn this into a handle in order to parse it:
As Poshi said, it should not be a Biopython issue, try with just:
filename = 'Prob2.fa'
with open(filename, 'r', encoding='utf-8') as f: #or encoding='gbk'
print(f.read())
on the same file and see if you get same error