pythonrbioinformaticsbioconductorprotein-database

retrieve 13mer peptide sequence from uniprotID and specific residue


I have a list of UniprotIDs with a corresponding residue of interest (e.g. Q7TQ48_S442). I need to retrieve the +/-6 residues around the specific site within the protein sequence(in the example, the sequence I need would be DIEAEASEERQQE). Can you suggest a method to do it for a list of IDs + residue of interest using Python, R, or an already available web-tool? Thanks, Emanuele


Solution

  • If I enter a list of protein IDs into UniProt from https://www.uniprot.org/uploadlists/ or by uploading a file, I get a table of results. At the top of the table, there is an option that allows you to select the columns - one option is the peptide sequence. (no programming needed so far - just upload the list of UIDs you are interested in).

    Now, to extract the specific sequence, this can be done in R using the substr command. Here, we'd want to add/subtract 6 from either end:

    len13seq <- with(uniprot_data, substr(peptide_sequence, start = ind - 6, stop = ind + 6 ))
    

    where in your example, ind = 442.

    To make this work you need to

    1. Separate your tags into two(+?) columns - the UniprotID and the site index. You can also include the amino acid if you need it for later analyses.
    2. Create a file with just the UniProtIDs which is fed into the UniProt database.
    3. Customize the displayed columns, making sure to get the sequence.
    4. Download the result and read it into R.
    5. Merge the original data frame (with the site index) with the downloaded results.
    6. generate the sequence in the neighborhood around your point of interest.

    It is possible to do this entirely within R - I did that at one point, but I'm not sure you need it unless you need the entire thing to be automated. If that's what you need, I would suggest checking out https://www.bioconductor.org/packages/3.7/bioc/html/UniProt.ws.html. I don't use Bioconductor often, so I'm not familiar with the package. When I previously used R to get UniProt data, what I was after was not available in the tablular output, and I had to modify my code quite a bit to get to the data I was after. Hopefully, the Bioconductor solution is easier than what I did.