pythonpython-requestsappendfasta

How to download multiple sequences in one fasta file from UniProt using Python 3


I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:

import requests

with open ('testfasta.txt', 'r') as infile:
    lines = infile.readlines()
count = 0
for line in lines:
    count+=1
    line = line.strip()
    access_id = line
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'

    URL = url_part1+access_id+url_part2
              
    response = requests.get (URL)
              
    with open((access_id)+".fa", "wb") as txtFile:
        txtFile.write(response.content)

print ("Total sequences downloaded = ", count)

This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">". e.g.

>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh

and so on


Solution

  • Something like this? Just write them all to the same file.

    import requests
    
    with open('testfasta.txt', 'r') as infile,
         open('results.fasta', 'w') as outfile:
      for count, line in enumerate(infile, 1):
        access_id = line.strip()              
        response = requests.get(
          f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
        # check that fetch succeeded; raise error if not
        response.raise_for_status()
        assert(response.text.startswith('>'))
        assert(response.text.endswith('\n'))
        outfile.write(response.text)
    
    print (f"Total sequences downloaded = {count}")
    

    This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the asserts with code to fix any such problems. I also made various changes to make it more idiomatic.

    A vague complication is that the response.content you download is not text, but bytes. You could decode it if you wanted to, but of course, Requests already does this for you, and provides that in response.text