rncbirentrez

.attrs and repetitive entries in an R list


I am trying to fetch some info from NCBI using this R script:

require(rentrez)
require(magrittr)
rs = "rs16891982"
rss = c("rs16891982", "rs12203592", "rs1408799", "rs10756819", "rs35264875", "rs1393350", "rs12821256", "rs17128291", "rs1800407", "rs12913832", "rs1805008", "rs4911414")
# given a rs number, return chr, bp, allele and gene name
annotateGeneName = function(rs) {
    anno = rentrez::entrez_search(db = "snp", term = rs) %>%
           "[["("ids")                                   %>%
           rentrez::entrez_summary(db = "snp", id = .)
           if(length(anno) < 1) {
               warning(sprintf("%s not found in dbSNP!", rs))
               return(invisible(NULL))
           }
           # there might be multiple entries
           # if "snp_id" is not in the list, then
           # it means multiple SNPs have been return for this search
           # just take the first hit
           if(! "snp_id" %in% names(anno)) {
               anno = anno[[1]]
           }
    chrpos = anno[["chrpos"]]
    EA     = anno$allele_origin %>% gsub("\\(.*", "", .)
    fEA    = anno$global_maf %>% gsub("/.*", "", .) %>% gsub("^.*=", "", .)
    genes  = dplyr::first(anno$genes, default = NA)
    res = data.frame(snp = rs, chrpos = chrpos, EA = EA, fEA = fEA, genes = genes)
    res
}
annotateGeneNames = function(rss) {
    do.call(rbind, lapply(rss, annotateGeneName))
}
ids = rentrez::entrez_search(db = "snp", term = rs) %>% "[["("ids")
x = rentrez::entrez_fetch(db = "snp", id = ids[1], rettype="xml")
snp1xml = xmlParse(x)
snp1list = xmlToList(snp1xml)
print(snp1list)

When you print the result out, you can see things like:

...
$Rs$Sequence$.attrs
     exemplarSs ancestralAllele 
    "285153617"   "C,C,C,C,C,C" 


$Rs$Ss$.attrs
        ssId       handle      batchId     locSnpId  subSnpClass       orient 
  "23456916"   "PERLEGEN"      "12309" "afd3693051"        "snp"    "forward" 
      strand      molType      buildId  methodClass    validated 
    "bottom"    "genomic"        "123"  "hybridize" "by-cluster" 


$Rs$Ss$.attrs
                          ssId                         handle 
                    "28510204"              "MGC_GENOME_DIFF" 
                       batchId                       locSnpId 
                       "12314" "BC064405x37550355-C16403799G" 
                   subSnpClass                         orient 
                         "snp"                      "forward" 
                        strand                        molType 
                      "bottom"                         "cDNA" 
                       buildId                    methodClass 
                         "126"                     "computed" 


$Rs$Ss
$Rs$Ss$Sequence
$Rs$Ss$Sequence$Seq5
[1] "TTCCCTTTCATTTTCCAGAGAAACTTGATCAGGAACCCACTGATTCCAAGAGCAAAGTAATCAGTGAGGAAATGACACCTAGAATTCATGATGAAAAAAGGATGCTTTATATGGTCCTTTTTAAGGTGATAGTTTTTCCTGACGTCCATAGATTTATTAAGAATCTGGTATTTTAAACAGTAGGAAATACACATAGAAATATCAAATCCAAGTTGTGCTAGACCAGAAACTTTTAGAAGACATCCTTAGGAGAGAGAAAGACTTACAAGAATAAAGTGAGGAAAACACGGAGTTGATGCA"

$Rs$Ss$.attrs

$Rs$Ss$Sequence
$Rs$Ss$Sequence$Seq5
[1] "AAGACATCCTTAGGAGAGAGAAAGACTTACAAGAATAAAGTGAGGAAAACACGGAGTTGATGCA"

$Rs$Assembly$Component$MapLoc$FxnSet
      geneId       symbol      mrnaAcc      mrnaVer      protAcc      protVer 
     "51151"    "SLC45A2"  "NM_016180"          "4"  "NP_057264"          "3" 
    fxnClass readingFrame       allele      residue   aaPosition 
 "reference"          "3"          "C"          "F"        "373" 

$Rs$Assembly$Component$MapLoc$FxnSet
                geneId                 symbol                mrnaAcc 
               "51151"              "SLC45A2"            "NM_016180" 
               mrnaVer                protAcc                protVer 
                   "4"            "NP_057264"                    "3" 
              fxnClass           readingFrame                 allele 
            "missense"                    "3"                    "G" 
               residue             aaPosition                 soTerm 
                   "L"                  "373" "non_synonymous_codon" 

There are a lot of .attrs entries in this list, and they are often repetitive. There are also other repetitive entries, such as:

$Rs$Ss$Sequence$Seq5
$Rs$Assembly$Component$MapLoc$FxnSet

etc.

What does .attrs mean, and how do I make sense of this data? I am not aware how you can have two entries of the same name in one list.


Solution

  • In R attributes and attr are functions that assign or extract attributes, but as far as I can tell `'.attr' is just a list location name. The meaning of it is essentially whatever the authors thought it should mean .... after that is your code got through with parsing the XML and converting it to an R list. It's not part of the definition of R, so read the documentation.

    I now see that you are bothered by list items having identical names. That is something that is possible in R. The "[" and "[[" will retrieve the first item in the tree that matches an name. Access would need to be numeric or mediated by lapply or sapply, functions that traverse the upper level of the tree to avoid ambiguity.

    > mylist=vector("list", length=2)
    > mylist
    [[1]]
    NULL
    
    [[2]]
    NULL
    
    > names(mylist) <- c("a","a")
    > mylist
    $a
    NULL
    
    $a
    NULL
    
    > mylist[['a']]
    NULL
    > mylist['a']
    $a
    NULL
    
    > lapply( mylist , "[[", "a")
    $a
    NULL
    
    $a
    NULL
    

    (I also do not see that either of those function definitions get used in the process of extracting and processing that data.)