I have the first paragraph of Wikipedia articles from the wikifacts
package (only for people). I like to extract birth year and year of death.
library(wikifacts)
library(tidyverse)
politicians <- data.frame(
Name = c("Barack Obama", "Angela Merkel", "Nelson Mandela", "Margaret Thatcher", "Mahatma Gandhi"),
stringsAsFactors = FALSE
)
politicians <- politicians %>%
mutate(First_Paragraph = substr(wiki_define(Name), 1, 200))
head(politicians)
> head(politicians)
Name
1 Barack Obama
2 Angela Merkel
3 Nelson Mandela
4 Margaret Thatcher
5 Mahatma Gandhi
First_Paragraph
1 Barack Hussein Obama II (born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. As a member of the Democratic Party, he was the first A
2 Angela Dorothea Merkel (German: [aŋˈɡɪːla doʁoˈteːa ˈmɛʁkl̩] ; née Kasner; born 17 July 1954) is a retired German politician who served as Chancellor of Germany from 2005 to 2021 and was the first wom
3 Nelson Rolihlahla Mandela ( man-DEH-lə; Xhosa: [xolíɬaɬa mandɛ̂ːla]; born Rolihlahla Mandela; 18 July 1918 – 5 December 2013) was a South African anti-apartheid activist, politician, and statesman who
4 Margaret Hilda Thatcher, Baroness Thatcher, (née Roberts; 13 October 1925 – 8 April 2013) was a British stateswoman and Conservative politician who was Prime Minister of the United Kingdom from 1979
5 Mohandas Karamchand Gandhi (ISO: Mōhanadāsa Karamacaṁda Gāṁdhī; 2 October 1869 – 30 January 1948) was an Indian lawyer, anti-colonial nationalist and political ethicist who employed nonviolent resista
I like to extract birth year, and if available year of death. Usually, these are the first two 4 digits that appear, or the first two 4 digits that are within the first pair of parentheses. I tried several ways of regular expressions of string extractions. What would be a nice and easy way, preferably in tidyverse
logic to get birth and death year?
An easy way to do this rather than using wikifacts
is to just use rvest
to get each Wikipedia page and look up the <span>
with the class bday
. This contains a YYYY-MM-DD formatted date, so it's easy to convert to a date object:
library(tidyverse)
library(rvest)
politicians %>%
rowwise() %>%
mutate(doc = list("https://en.wikipedia.org/wiki/" %>%
paste0(Name %>% stringr::str_replace_all(' ', '_')) %>%
read_html())) %>%
ungroup() %>%
mutate(DOB = map(doc, ~ .x %>%
html_element('.bday') %>%
html_text()) %>% unlist() %>% as.Date(),
DOD = map(doc, function(x) {
vals <- x %>%
html_elements(xpath = paste0("//td[@class='infobox-data']/",
"span[@style='display:none']")) %>%
html_text()
if(length(vals) == 2) substr(vals[2], 2, 11) else NA_character_
}) %>% unlist() %>% as.Date()) %>%
select(-doc)
#> # A tibble: 5 x 3
#> Name DOB DOD
#> <chr> <date> <date>
#> 1 Barack Obama 1961-08-04 NA
#> 2 Angela Merkel 1954-07-17 NA
#> 3 Nelson Mandela 1918-07-18 2013-12-05
#> 4 Margaret Thatcher 1925-10-13 2013-04-08
#> 5 Mahatma Gandhi 1869-10-02 1948-01-30