I am trying to extract pubmed abstracts and their titles to place them in a dataframe. will the help of members stackoverflow, I was able to write the code below, which works. The issue now is that the number of rows in the abstracts variable is higher than that of pmid or title, therefore I am unable to merge them correctly. Looking at the structure of the xml file I have, it appears the abstracts have more than one ?node, that's why they get extracted in > one row. Any suggestion how to overcome that and have each abstract in one row, so I can merge the variables.
Here is my code:
library(XML)
library(httr)
library(glue)
library(dplyr)
####
query = 'asthma[mesh]+AND+eosinophils[mesh]+AND+2009[pdat]'
reqq = glue ('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&RetMax=50&term={query}')
op = GET(reqq)
content(op)
df_op <- op %>% xml2::read_xml() %>% xml2::as_list()
pmids <- df_op$eSearchResult$IdList %>% unlist(use.names = FALSE)
reqq1 = glue("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={paste0(pmids, collapse = ',')}&rettype=abstract&retmode=xml")
op1 = GET(reqq1)
a = xmlParse(content(op1))
pmidd = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue))
title = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/ArticleTitle', xmlValue))
abstract = as.data.frame(xpathSApply(a, '/PubmedArticleSet/PubmedArticle/MedlineCitation/Article/Abstract/AbstractText', xmlValue))
nrow(pmidd)
nrow(abstract)
Some articles come with the abstract spread in several sections (Objective, Methods, ....), some have just one entry and then some don't have an abstract at all. You'll have to take care of all these different scenarios.
xml::xmlToList()
can be used to extract a list from the xml data. We can then use purrr
's map*()
commands to flatten the data.
library(purrr)
b <- xmlToList(a)
res <- map_dfr(b, \(x) {
abstract_l <- x$MedlineCitation$Article$Abstract
if (is.null(abstract_l))
abstract_l <- ""
tibble(
pmid = x$MedlineCitation$PMID$text,
title = x$MedlineCitation$Article$ArticleTitle,
abstract = ifelse(
length(abstract_l) > 1,
map_chr(abstract_l, \(y) y[[1]]) |> paste(collapse = "\n"),
unlist(abstract_l)
)
)
})
res$abstract