rbioinformaticsdna-sequencerentrez

Rentrez is pulling the wrong data from NCBI in R?


I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:

search <- entrez_search(db = "biosample", 
                        term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
                        retmax = 9999, use_history = T)

Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:

search <- entrez_search(db = "biosample", 
                        term = "SAMN30954130[ACCN]",
                        retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
                           id = search$ids,
                           rettype = "xml")
fetch_list <- xmlToList(fetch_test)

Solution

  • The issue here is that you are using a Biosample UID to query the Nucleotide database. However, the UID is then interpreted as a Nucleotide UID, so you get a sequence record unrelated to your original Biosample query.

    What you need to use in this situation is entrez_link, which uses a UID to link records between two databases.

    For example, your Biosample accession SAMN30954130 has the Biosample UID 30954130. You link that to Nucleotide like this:

    nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')
    

    And you can get the corresponding Nucleotide UID(s) like this:

    nuc_links$links$biosample_nuccore
    
    [1] "2307876014"
    

    And then:

    fetch_test <- entrez_fetch(db = "nucleotide",
                               id = 2307876014,
                               rettype = "xml")
    

    This is covered in the section "Finding cross-references" of the rentrez tutorial.