I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:
import requests
with open ('testfasta.txt', 'r') as infile:
lines = infile.readlines()
count = 0
for line in lines:
count+=1
line = line.strip()
access_id = line
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1+access_id+url_part2
response = requests.get (URL)
with open((access_id)+".fa", "wb") as txtFile:
txtFile.write(response.content)
print ("Total sequences downloaded = ", count)
This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">". e.g.
>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh
and so on
Something like this? Just write them all to the same file.
import requests
with open('testfasta.txt', 'r') as infile,
open('results.fasta', 'w') as outfile:
for count, line in enumerate(infile, 1):
access_id = line.strip()
response = requests.get(
f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
# check that fetch succeeded; raise error if not
response.raise_for_status()
assert(response.text.startswith('>'))
assert(response.text.endswith('\n'))
outfile.write(response.text)
print (f"Total sequences downloaded = {count}")
This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the assert
s with code to fix any such problems. I also made various changes to make it more idiomatic.
A vague complication is that the response.content
you download is not text, but bytes
. You could decode
it if you wanted to, but of course, Requests already does this for you, and provides that in response.text