rtidyversedata-manipulationstringrwikifacts

How to extract birth and death year from string in R?


I have the first paragraph of Wikipedia articles from the wikifacts package (only for people). I like to extract birth year and year of death.

library(wikifacts)
library(tidyverse)

politicians <- data.frame(
  Name = c("Barack Obama", "Angela Merkel", "Nelson Mandela", "Margaret Thatcher", "Mahatma Gandhi"),
  stringsAsFactors = FALSE
)

politicians <- politicians %>% 
  mutate(First_Paragraph = substr(wiki_define(Name), 1, 200)) 

head(politicians)


> head(politicians)
               Name
1      Barack Obama
2     Angela Merkel
3    Nelson Mandela
4 Margaret Thatcher
5    Mahatma Gandhi
                                                                                                                                                                                           First_Paragraph
1 Barack Hussein Obama II (born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. As a member of the Democratic Party, he was the first A
2  Angela Dorothea Merkel (German: [aŋˈɡɪːla doʁoˈteːa ˈmɛʁkl̩] ; née Kasner; born 17 July 1954) is a retired German politician who served as Chancellor of Germany from 2005 to 2021 and was the first wom
3  Nelson Rolihlahla Mandela ( man-DEH-lə; Xhosa: [xolíɬaɬa mandɛ̂ːla]; born Rolihlahla Mandela; 18 July 1918 – 5 December 2013) was a South African anti-apartheid activist, politician, and statesman who
4 Margaret Hilda Thatcher, Baroness Thatcher,  (née Roberts; 13 October 1925 – 8 April 2013) was a British stateswoman and Conservative politician who was Prime Minister of the United Kingdom from 1979 
5 Mohandas Karamchand Gandhi (ISO: Mōhanadāsa Karamacaṁda Gāṁdhī; 2 October 1869 – 30 January 1948) was an Indian lawyer, anti-colonial nationalist and political ethicist who employed nonviolent resista

I like to extract birth year, and if available year of death. Usually, these are the first two 4 digits that appear, or the first two 4 digits that are within the first pair of parentheses. I tried several ways of regular expressions of string extractions. What would be a nice and easy way, preferably in tidyverse logic to get birth and death year?


Solution

  • An easy way to do this rather than using wikifacts is to just use rvest to get each Wikipedia page and look up the <span> with the class bday. This contains a YYYY-MM-DD formatted date, so it's easy to convert to a date object:

    library(tidyverse)
    library(rvest)
    
    politicians %>%
      rowwise() %>%
      mutate(doc = list("https://en.wikipedia.org/wiki/" %>% 
          paste0(Name %>% stringr::str_replace_all(' ', '_')) %>%
          read_html())) %>%
      ungroup() %>%
      mutate(DOB = map(doc, ~ .x %>%
                    html_element('.bday') %>%
                    html_text()) %>% unlist() %>% as.Date(),
             DOD = map(doc, function(x) {
                   vals <- x %>%
                    html_elements(xpath = paste0("//td[@class='infobox-data']/",
                                                 "span[@style='display:none']")) %>%
                    html_text()
                   if(length(vals) == 2) substr(vals[2], 2, 11) else NA_character_
                  }) %>% unlist() %>% as.Date()) %>%
      select(-doc)
    
    #> # A tibble: 5 x 3
    #>   Name              DOB        DOD       
    #>   <chr>             <date>     <date>    
    #> 1 Barack Obama      1961-08-04 NA        
    #> 2 Angela Merkel     1954-07-17 NA        
    #> 3 Nelson Mandela    1918-07-18 2013-12-05
    #> 4 Margaret Thatcher 1925-10-13 2013-04-08
    #> 5 Mahatma Gandhi    1869-10-02 1948-01-30