I am looking for a way to retrieve FASTA files from UniProt by specifying the protein UniProt ID in input. My goal is to create a Google Colab that is able to create FASTA files where I can specify the FASTA name, the directory (in Google Drive) where I want to save it and take Uniprot IDs in the format 1xUniProt1, 3xUniProt2, where 3x is the number of times I want that sequence in the FASTA file separated by a ':'.
I was thinking something like this:
In input:
Name = protein_sequences
Proteins = 2xUniprot1, 3xUniprot2, 1xUniprot3
Directory = FASTA_directory
In output:
Name of file = protein_sequences.fasta
FASTA file:
> protein_sequences sequenceUniprot1:sequenceUniprot1:sequenceUniprot2:sequenceUniprot2:sequenceUniprot2:sequenceUniprot3
The main problem I have is that I am not sure how to fetch the sequences themselves from UniProt using Python. I don't know what the latest and most efficient way of doing this is.
Looks like UniProt has a REST api, so I would try to fetch the protein info from there: https://www.uniprot.org/help/programmatic_access
You need to make http calls to this API. For that I recommend the httpx library. Their documentation should guide you through the process, if you've never done anything like that.