I have a list of UniprotIDs with a corresponding residue of interest (e.g. Q7TQ48_S442). I need to retrieve the +/-6 residues around the specific site within the protein sequence(in the example, the sequence I need would be DIEAEASEERQQE). Can you suggest a method to do it for a list of IDs + residue of interest using Python, R, or an already available web-tool? Thanks, Emanuele
If I enter a list of protein IDs into UniProt from https://www.uniprot.org/uploadlists/ or by uploading a file, I get a table of results. At the top of the table, there is an option that allows you to select the columns - one option is the peptide sequence. (no programming needed so far - just upload the list of UIDs you are interested in).
Now, to extract the specific sequence, this can be done in R using the substr
command. Here, we'd want to add/subtract 6 from either end:
len13seq <- with(uniprot_data, substr(peptide_sequence, start = ind - 6, stop = ind + 6 ))
where in your example, ind = 442
.
To make this work you need to
It is possible to do this entirely within R - I did that at one point, but I'm not sure you need it unless you need the entire thing to be automated. If that's what you need, I would suggest checking out https://www.bioconductor.org/packages/3.7/bioc/html/UniProt.ws.html. I don't use Bioconductor often, so I'm not familiar with the package. When I previously used R to get UniProt data, what I was after was not available in the tablular output, and I had to modify my code quite a bit to get to the data I was after. Hopefully, the Bioconductor solution is easier than what I did.