rbioconductorncbirentrez

How to keep Protein ID when retrieving coding sequences with rentrez


I have a bunch of protein IDs and I need to retrieve the corresponding coding sequences (CDSs). I have managed to retrieve the CDSs but the names of each sequence change from XP* to XM*, and I need to retain the XP* header for each sequence.

Basically it looks like this:

library(renters)
search1 <- entrez_search(db="protein", term="XP_012370245[Accn]")
links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])

And the output looks like this:

> rec
[1] ">XM_012514791.1 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
\nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

Is there a way to keep the protein id (XP_012370245) instead of the nucleotide id (XM_012514791.1)? Something like:

> rec
[1] ">XP_012370245
 \nATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG\nCCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG (...)

I have done this with BioMart R package (biomaRt) but with rentrez seems to be more difficult.

Any suggestion is very welcome, thanks!


Solution

  • This is an answer from David Winter from the rentrez github repository:

    I don't think there is any way to do this on the NCBI end (nucleotide records have nucleotide IDs).

    You can probably cook something up to replace the IDs through stringr or the base string manipulation functions. This works for your example (though you probably want to add some checks to make sure you are actually findings CDS sequences etc):

    fetch_cds <- function(prot_acc){
        search1 <- entrez_search(db="protein", term=paste0(prot_acc, "[Accn]"))
        links <- entrez_link(dbfrom="protein", db="nuccore", id=search1$ids)
        rec <- entrez_fetch(db="nuccore", rettype="fasta", id=links$links$protein_nuccore_mrna[1])
        sub("XM_\\d+\\.\\d", prot_acc, rec)    
    }
    
    cat(substr(fetch_cds("XP_012370245"), 1, 500), "\n")
    
    >XP_012370245 PREDICTED: Octodon degus Hermansky-Pudlak syndrome 6 (Hps6), mRNA
    ATGCAGAGGAAAAACTTTATGTCATTCTTCACAGGCTTCCTGGAGAAGCGGGCCTGGCCGGAGGCCCGCG
    CCGCGCCGCTGGACGCCTTCTTCCTGGCGTGGCCGGCGCAGCCCGCGGTGGCGCTGGTGTGGGAGAGCGG
    CCTGGCGGAGCTCTGGGGTGCCGACCTGGGGTCCGCCTGGAGGCGGCTTCACGCCACCGAACTGTGTCCG
    CGCGGCGCAGTCCGCGTGGTGGCAGCGGTGGCGCCGCGGGGCCGCCTGGTGTGGTGCGAGGAGCGCCCGG
    GCGCGGGCGGACGCCGCGTGTGCGTCCGCACCCTAGAGCCTGGCGGCGAGACTGGTGCCCGCCTGGGCCG
    CACGCACGTCCTGCTGCACCACTGCCCGCCCTTCCACCTGCTGGCCTCGCGCAAGGACGTCTTCC