I am trying to fetch some info from NCBI using this R script:
require(rentrez)
require(magrittr)
rs = "rs16891982"
rss = c("rs16891982", "rs12203592", "rs1408799", "rs10756819", "rs35264875", "rs1393350", "rs12821256", "rs17128291", "rs1800407", "rs12913832", "rs1805008", "rs4911414")
# given a rs number, return chr, bp, allele and gene name
annotateGeneName = function(rs) {
anno = rentrez::entrez_search(db = "snp", term = rs) %>%
"[["("ids") %>%
rentrez::entrez_summary(db = "snp", id = .)
if(length(anno) < 1) {
warning(sprintf("%s not found in dbSNP!", rs))
return(invisible(NULL))
}
# there might be multiple entries
# if "snp_id" is not in the list, then
# it means multiple SNPs have been return for this search
# just take the first hit
if(! "snp_id" %in% names(anno)) {
anno = anno[[1]]
}
chrpos = anno[["chrpos"]]
EA = anno$allele_origin %>% gsub("\\(.*", "", .)
fEA = anno$global_maf %>% gsub("/.*", "", .) %>% gsub("^.*=", "", .)
genes = dplyr::first(anno$genes, default = NA)
res = data.frame(snp = rs, chrpos = chrpos, EA = EA, fEA = fEA, genes = genes)
res
}
annotateGeneNames = function(rss) {
do.call(rbind, lapply(rss, annotateGeneName))
}
ids = rentrez::entrez_search(db = "snp", term = rs) %>% "[["("ids")
x = rentrez::entrez_fetch(db = "snp", id = ids[1], rettype="xml")
snp1xml = xmlParse(x)
snp1list = xmlToList(snp1xml)
print(snp1list)
When you print the result out, you can see things like:
...
$Rs$Sequence$.attrs
exemplarSs ancestralAllele
"285153617" "C,C,C,C,C,C"
$Rs$Ss$.attrs
ssId handle batchId locSnpId subSnpClass orient
"23456916" "PERLEGEN" "12309" "afd3693051" "snp" "forward"
strand molType buildId methodClass validated
"bottom" "genomic" "123" "hybridize" "by-cluster"
$Rs$Ss$.attrs
ssId handle
"28510204" "MGC_GENOME_DIFF"
batchId locSnpId
"12314" "BC064405x37550355-C16403799G"
subSnpClass orient
"snp" "forward"
strand molType
"bottom" "cDNA"
buildId methodClass
"126" "computed"
$Rs$Ss
$Rs$Ss$Sequence
$Rs$Ss$Sequence$Seq5
[1] "TTCCCTTTCATTTTCCAGAGAAACTTGATCAGGAACCCACTGATTCCAAGAGCAAAGTAATCAGTGAGGAAATGACACCTAGAATTCATGATGAAAAAAGGATGCTTTATATGGTCCTTTTTAAGGTGATAGTTTTTCCTGACGTCCATAGATTTATTAAGAATCTGGTATTTTAAACAGTAGGAAATACACATAGAAATATCAAATCCAAGTTGTGCTAGACCAGAAACTTTTAGAAGACATCCTTAGGAGAGAGAAAGACTTACAAGAATAAAGTGAGGAAAACACGGAGTTGATGCA"
$Rs$Ss$.attrs
$Rs$Ss$Sequence
$Rs$Ss$Sequence$Seq5
[1] "AAGACATCCTTAGGAGAGAGAAAGACTTACAAGAATAAAGTGAGGAAAACACGGAGTTGATGCA"
$Rs$Assembly$Component$MapLoc$FxnSet
geneId symbol mrnaAcc mrnaVer protAcc protVer
"51151" "SLC45A2" "NM_016180" "4" "NP_057264" "3"
fxnClass readingFrame allele residue aaPosition
"reference" "3" "C" "F" "373"
$Rs$Assembly$Component$MapLoc$FxnSet
geneId symbol mrnaAcc
"51151" "SLC45A2" "NM_016180"
mrnaVer protAcc protVer
"4" "NP_057264" "3"
fxnClass readingFrame allele
"missense" "3" "G"
residue aaPosition soTerm
"L" "373" "non_synonymous_codon"
There are a lot of .attrs entries in this list, and they are often repetitive. There are also other repetitive entries, such as:
$Rs$Ss$Sequence$Seq5
$Rs$Assembly$Component$MapLoc$FxnSet
etc.
What does .attrs mean, and how do I make sense of this data? I am not aware how you can have two entries of the same name in one list.
In R attributes
and attr
are functions that assign or extract attributes, but as far as I can tell `'.attr' is just a list location name. The meaning of it is essentially whatever the authors thought it should mean .... after that is your code got through with parsing the XML and converting it to an R list. It's not part of the definition of R, so read the documentation.
I now see that you are bothered by list items having identical names. That is something that is possible in R. The "[" and "[[" will retrieve the first item in the tree that matches an name. Access would need to be numeric or mediated by lapply or sapply, functions that traverse the upper level of the tree to avoid ambiguity.
> mylist=vector("list", length=2)
> mylist
[[1]]
NULL
[[2]]
NULL
> names(mylist) <- c("a","a")
> mylist
$a
NULL
$a
NULL
> mylist[['a']]
NULL
> mylist['a']
$a
NULL
> lapply( mylist , "[[", "a")
$a
NULL
$a
NULL
(I also do not see that either of those function definitions get used in the process of extracting and processing that data.)