rbioconductorncbirentrez

How to track which protein ID is linked to which gene ID with rentrez


I have a bunch of protein IDs and I want to fetch the corresponding coding sequences (CDSs) without loosing the protein ID. I have managed to download the corresponding CDSs, but unfortunately, CDSs IDs are very different from protein IDs in NCBI.

I have the following R code:

library(rentrez)
Prot_ids <- c("XP_012370245.1","XP_004866438.1","XP_013359583.1")
links <- entrez_link(dbfrom="protein", db="nuccore", id=Prot_ids, by_id = TRUE)

And then, I used this command to "match" protein IDs with CDS IDs:

lapply(links, function(x) x$links$protein_nuccore_mrna)

[[1]]
[1] "820968283"

[[2]]
[1] "861491027"

[[3]]
[1] "918634580"

However, as you can see the argument 'by_id=TRUE' just make a list of three elink objects but now I have lost the protein IDs.

I would want something like:

Protein ID XP_012370245.1 XP_004866438.1 XP_013359583.1

CDS ID XM_004866381.2 XM_012514791.1 XM_013504129.1

Any suggestion is very welcome, thanks!!


Solution

  • library(rentrez)
    Prot_ids <- c("XP_012370245.1","XP_004866438.1","XP_013359583.1")
    links <- entrez_link(dbfrom="protein", db="nuccore", id=Prot_ids, by_id = TRUE)
    linkids <- sapply(links, function(x) x$links$protein_nuccore_mrna)
    ##Get the summary for the gi record
    linkNuc <- entrez_summary(id = linkids, db = "nuccore")
    
    df <- data.frame(ProtIDs = Prot_ids[rep(sapply(links, function(x) length(x$links$protein_nuccore_mrna)))], 
                     linkids, 
                     NucID=sapply(strsplit(sapply(linkNuc, "[[", "extra"), split = "\\|"), "[", 4))
    
    #                 ProtIDs   linkids          NucID
    #820968283 XP_012370245.1 820968283 XM_012514791.1
    #861491027 XP_012370245.1 861491027 XM_004866381.2
    #918634580 XP_012370245.1 918634580 XM_013504129.1