pythonduplicatesbiopythonfasta

Errors in removing duplicate sequences fasta file - problem in the header


I am trying to combine some protein sequences in fasta format and then remove duplicates. I found this code by searching and it works well enough but I ran into an issue that I couldn't understand. Here is the example sequence which is causing the error:

>someseq1
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>firstseq with 5 mutations:
MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>secondseq with 9 mutations:
MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
>thirdseq
MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
>thirdone – in claim
MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR

The original file and sequences are long so I shortened it for ease.

I found this code on the forum which works fine and writes a new file without duplicates:

from Bio import SeqIO
import time

start = time.time()

seen = []
records = []

for record in SeqIO.parse("Prob2.fa", "fasta"):
    if str(record.seq) not in seen:
        seen.append(str(record.seq))
        records.append(record)


#writing to a fasta file
SeqIO.write(records, "Checked.fa", "fasta")
end = time.time()

print(f"Run time is {(end- start)/60}")

Now, the Python interpreter is giving me this error:

Traceback (most recent call last):
  File "C:\Users\Arif\Desktop\DuplicateSequenceFinder\DuplicateFinder.py", line 10, in <module>
    for record in SeqIO.parse("Prob2.fa", "fasta"):
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\Interfaces.py", line 72, in __next__
    return next(self.records)
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 238, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File "C:\Users\Arif\AppData\Local\Programs\Python\Python311\Lib\site-packages\Bio\SeqIO\FastaIO.py", line 50, in SimpleFastaParser
    for line in handle:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 449: illegal multibyte sequence

I found the problem is in the header with "-" character written as "- in claim" (the third sequence in list). If I remove that it works fine, but there are other "-" mentioned in other sequence headers as well. I found it by removing half of the sequences and checking if it still gives an error. Now, if I delete this "-" and type a new "-", it works fine. So I am just trying to understand what is the real problem here. So I can write in the correct input format in the future.

I originally wrote these sequences in Word, and later edit them in Notepad++ and save it as ".fa" file.

Secondly, I want to find out how many duplicates were found and mention the record IDs/headers. So if someone can help me with what lines of codes I should insert, I will be very grateful.


Solution

  • OK my attempt , cannot reproduce your error. But using your same input:

    >someseq1
    MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
    >firstseq with 5 mutations:
    MKYFPLFPTLVFAARVVAFPAYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
    >secondseq with 9 mutations:
    MKYFPLFPTLVYAVGVVAFPDYASLAGLSQQELDAIIPTLEAREPGLPPGPLENSSAKLV
    >thirdseq
    MISQSFVSLTVLLLGLVNLSPAFAFPQYGSLAGLSARDLNVLIPRLNEVDPPTPPGPLAYNGTKLVHDDA
    >thirdone - in claim
    MISTSKHLFVLLPLFLVSHLSLVLGFPAYASLGGLTERQVEEYTSKLPIVFPPPPPEPIKDPWLKLVNDR
    

    try with the following code:

    from Bio import SeqIO
    import time
    
    start = time.time()
    
    seen = []
    records = []
    
    filename = 'Prob2.fa'
    
    with open(filename, 'r', encoding='utf-8') as f:
        
        for record in SeqIO.parse(f, "fasta"):
            if str(record.seq) not in seen:
                seen.append(str(record.seq))
                records.append(record)
    
    
    #writing to a fasta file
    SeqIO.write(records, "Checked.fa", "fasta")
    end = time.time()
    
    print(f"Run time is {(end- start)/60}")
    

    let us know if it is working.

    I can reproduce your error using in my code:

    with open(filename, 'r', encoding='gbk') as f:
    

    adding the char : to one of your headers

    but I dont get the error anymore if I delete the from the fasta header

    As Poshi pointed out:

    This looks like an encoding issue. Not sure why the data is being decoded with the GBK decoder.

    SEE https://github.com/biopython/biopython/blob/master/Bio/SeqIO/init.py#L559 for explanation about :

    how to feed data to SEqIO.parse(.. :

    Arguments: - handle - handle to the file, or the filename as a string

    .......

    If you have a string 'data' containing the file contents, you must first turn this into a handle in order to parse it:

    As Poshi said, it should not be a Biopython issue, try with just:

    filename = 'Prob2.fa'
    
    with open(filename, 'r', encoding='utf-8') as f: #or encoding='gbk' 
        
        print(f.read())
    

    on the same file and see if you get same error