htmlrxmldataframepubmed

Extracting affiliation information from PubMed search string in R


I need some help extracting affiliation information from PubMed search strings in R. I have already successfully extracted affiliation information from a single PubMed ID XML, but now I have a search string of multiple terms that I need to extract the affiliation information from with hope of then creating a data frame with columns such as: PMID, author, country, state etc.

This is my code so far:

my_query <- (PubMed Search String)
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")

The PubMed search string is very long, hence why I haven't included it here. The main aim is therefore to produce a dataframe from this search string which is a table clearly showing affiliation and other general information from the PubMed articles.

Any help would be greatly appreciated!


Solution

  • Have you tried the pubmedR package? https://cran.rstudio.com/web/packages/pubmedR/index.html

    library(pubmedR)
    library(purrr)
    library(tidyr)
    
    my_query <- '(((("diabetes mellitus"[MeSH Major Topic]) AND ("english"[Language])) AND (("2020/01/01"[Date - Create] : "3000"[Date - Create]))) AND ("coronavirus"[MeSH Major Topic])'
    
    my_request <- pmApiRequest(query = my_query,
                                limit = 5)
    

    You can use the built in function my_pm_df <- pmApi2df(my_request) but this will not provide affiliations for all authors.

    You can use a combination of pluck() and map() from purrr to extract what you need into a tibble.

    auth <- pluck(my_request, "data") %>% {
      tibble(
        pmid = map_chr(., pluck, "MedlineCitation", "PMID", "text"),
        author_list = map(., pluck, "MedlineCitation", "Article", "AuthorList")
      )
      }
    

    All author data is contained in that nested list, in the Author$AffiliationInfo list (note it is a list because one author can have multiple affiliations).

    ================================================= EDIT based on comments:

    First construct your request URLs. Make sure you replace &email with your email address:

    library(httr)
    library(xml2)
    
    mypmids <- c("32946812", "32921748", "32921727", "32921708", "32911500", 
                 "32894970", "32883566", "32880294", "32873658", "32856805",
                 "32856803", "32820143", "32810084", "32809963", "32798472")
    
    my_query <- paste0("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=",
                       mypmids,
                       "&retmode=xml&email=MYEMAIL@MYDOMAIN.COM")
    

    I like to wrap my API requests in safely to catch any errors. Then use map to loop through the my_query vector. Note we Sys.sleep for 5 seconds after each request to comply with PubMed's rate limit. You can probably cut this down a bit seconds or even less, check in the API documentation.

    get_safely <- safely(GET)
    
    my_req <- map(my_query, function(z) {
      print(z)
      req <- get_safely(url = z)
      Sys.sleep(5)
      return(req)
    })
    

    Next we parse the request with content() in read_xml(). Note that we are parsing the result:

    my_resp <- map(my_req, function(z) {
      read_xml(content(z$result,
                       as = "text",
                       encoding = "UTF-8"))
    })
    

    This can probably be cleaned up some but it works. Coerce the AuthorInfo to a list and use a combination of map() , pluck() and unnest(). Note that a given author might have more than one affiliation but am only plucking the first one.

    my_pm_list <- map(my_resp, function (z) {
      my_xml <- xml_child(xml_child(z, 1), 1)
      pmid <- xml_text(xml_find_first(my_xml, "//PMID"))
      authinfo <- as_list(xml_find_all(my_xml, ".//AuthorList"))
      return(list(pmid, authinfo))
    })
    
    myauthinfo <- map(my_pmids, function(z) {
      auth <- z[[2]][[1]]
    })
    
    mytibble <- myauthinfo %>% {
      tibble(
        lastname = map_depth(., 2, pluck, "LastName", 1, .default = NA_character_),
        firstname = map_depth(., 2, pluck, "ForeName", 1, .default = NA_character_),
        affil = map_depth(., 2, pluck, "AffiliationInfo", "Affiliation", 1, .default = NA_character_)
      )
    }
    
    my_unnested_tibble <- mytibble %>%
      bind_cols(pmid = map_chr(my_pm_list, pluck, 1)) %>%
      unnest(c(lastname, firstname, affil))