rxmlparsing

R: how to extract a PubMed abstract using rentrez


library(rentrez)
record <- entrez_fetch(db = "pubmed", id = 37757828, rettype = "xml", parsed = TRUE)

Given a PubMed id, I would like to extract the abstract from the record. Specifically, I would like to extract the text between

<AbstractText>
...
</AbstractText>

I tried to parse it first using the following but got an error:

library(XML)
> xmlParse(record)
Error in as.vector(x, "character") : 
  cannot coerce type 'externalptr' to vector of type 'character'

Solution

  • Here is how we can do it:

    library(XML)
    library(rentrez)
    
    record <- entrez_fetch(db = "pubmed", id = 37757828, rettype = "xml", parsed = TRUE)
    
    abstract_nodes <- xpathSApply(record, "//AbstractText", xmlValue)
    
    if (length(abstract_nodes) > 0) {
      abstract_text <- abstract_nodes[[1]]
      print(abstract_text)
    } else {
      print("No abstract found.")
    }
    
    
    [1] "Autozygosity is associated with rare Mendelian disorders and clinically relevant quantitative traits. We investigated associations between the fraction of the genome in runs of homozygosity (FROH) and common diseases in Genes & Health (n = 23,978 British South Asians), UK Biobank (n = 397,184), and 23andMe. We show that restricting analysis to offspring of first cousins is an effective way of reducing confounding due to social/environmental correlates of FROH. Within this group in G&H+UK Biobank, we found experiment-wide significant associations between FROH and twelve common diseases. We replicated associations with type 2 diabetes (T2D) and post-traumatic stress disorder via within-sibling analysis in 23andMe (median n = 480,282). We estimated that autozygosity due to consanguinity accounts for 5%-18% of T2D cases among British Pakistanis. Our work highlights the possibility of widespread non-additive genetic effects on common diseases and has important implications for global populations with high rates of consanguinity."