I am seeking to use R for web-scraping of google scholar, for instances where someone does not have a public profile.
One challenge is that there's a limit of 10 results at a time -- so, for someone with a lot of publications, multiple lines of code are required. I'd like to iterate where possible.
Here's the 'manual' version:
packages <- c("rvest", "xml2", "curl", "data.table")
lapply(packages, library, character.only = T)
#set up searches, with a "start" counter for the successive pages of results:
url_name0 <- 'http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998'
url_name1 <- 'http://scholar.google.com/scholar?start=10&q=author:"pi+campbell"&as_ylo=1998'
url_name2 <- 'http://scholar.google.com/scholar?start=20&q=author:"pi+campbell"&as_ylo=1998'
url_name3 <- 'http://scholar.google.com/scholar?start=30&q=author:"pi+campbell"&as_ylo=1998'
#scrape
wp0 <- xml2::read_html(url_name0)
wp1 <- xml2::read_html(url_name1)
wp2 <- xml2::read_html(url_name2)
wp3 <- xml2::read_html(url_name3)
# Extract raw data (titles)
titles0 <- rvest::html_text(rvest::html_nodes(wp0, '.gs_rt'))
titles1 <- rvest::html_text(rvest::html_nodes(wp1, '.gs_rt'))
titles2 <- rvest::html_text(rvest::html_nodes(wp2, '.gs_rt'))
titles3 <- rvest::html_text(rvest::html_nodes(wp3, '.gs_rt'))
Now, for the latter section, it should be possible to iterate. I've tried a for-loop:
counter <- 0:3
titles <- vector("list", length(counter))
for (i in seq_along(counter)) {
titles[[i]] <- rvest::html_text(rvest::html_nodes(wp[[i]], '.gs_rt'))
}
But this yields an error:
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "NULL"
(At an earlier stage I was getting a different error: )
Error in html_elements(...) : object 'wp' not found
There are subsequent operations that have a similar structure -- so if I can figure out iteration for this one, I should be able to write some more efficient code...
You are currently storing html_document
s from read_html()
in individual objects (wp0
, wp1
.. ), but in your loop you attempt to access wp
list that you have not set up / filled yet.
Perhaps something bit less manual, e.g. navigating through pages with rvest::session()
:
library(rvest)
s <- session('http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998')
# Extract aprox. number of results (including citations)
about_n_results <-
html_elements(s, "#gs_ab_md .gs_ab_mdw") |>
html_text() |>
stringr::str_extract("(?<=About )\\d+") |>
as.integer()
# Storage list
titles <- vector(mode = "list", length = ceiling(about_n_results / 10))
# Extract titles, store in a list, try to follow Next link;
# repeat while there is a Next link,
# break when number of returned titles is < 10 (presumably only citations will follow)
while (!is.null(s)) {
current_page_n <-
html_elements(s, "span.gs_ico_nav_current ~ b") |>
html_text(trim = TRUE) |>
as.integer()
titles[[current_page_n]] <- html_elements(s, '.gs_rt a') |> html_text(trim = TRUE)
if (length(titles[[current_page_n]]) < 10) break
# Follow Next link, return NULL when there isn't one
s <-
tryCatch(
error = function(cnd) NULL,
session_follow_link(s, xpath = "//b[text() = 'Next']/../../a")
)
}
#> Navigating to
#> </scholar?start=10&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
#> Navigating to
#> </scholar?start=20&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
#> Navigating to
#> </scholar?start=30&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
unlist(titles)
#> [1] "'Black Lives Matter:'sport, race and ethnicity in challenging times"
#> [2] "'Pray (ing) the person marking your work isn't racist': racialised inequities in HE assessment practice"
#> [3] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup"
#> [4] "Sport, race and ethnicity in the wake of black lives matter: Introduction to the special issue"
#> [5] "A ('black') historian using sociology to write a history of 'black'sport: A critical reflection"
#> [6] "Education, Retirement and Career Transitions for'Black'Ex-Professional Footballers: From Being Idolised to Stacking Shelves"
#> [7] "Black British Students' Experiences of Assessment"
#> [8] "'That black boy's different class!': a historical sociology of the black middle-classes, boundary-work and local football in the British East-Midlands c. 1970− 2010"
#> [9] "White British Students' Experiences of Assessment"
#> [10] "'Race', politics and local football–continuity and change in the life of a British African-Caribbean local football club"
#> [11] "THE EFFECTS OF RACIALLY INCLUSIVE ASSESSMENT ON THE RACE AWARD GAP AND ON STUDENTS'LIVED EXPERIENCES OF ASSESSMENT"
#> [12] "Cavaliers Made Us 'United': Local Football, Identity Politics and Second-generation African-Caribbean Youth in the East Midlands c.1970–9"
#> [13] "Electrolytes and pH changes in pre-eclamptic rats"
#> [14] "Racially Inclusive Assessment and Academic Teaching Staff"
#> [15] "Race and Assessment in Higher Education: From Conceptualising Barriers to Making Measurable Change"
#> [16] "Conceptualising Inter-and Intra-Race-Based Barriers in Assessment"
#> [17] "Afterword: 12 Years a Black Race Inclusion Academic – Some Reflections on Working in a 'Postracism' Space"
#> [18] "'White digital footballers can't jump': (re)constructions of race in FIFA 20"
#> [19] "British South Asian Students' Experiences of Assessment"
#> [20] "'Policy Shorts': Mapping and 'Tackling'Racial Inequities in HE Assessment–Summarising the Case Study"
#> [21] "Evaluating the Racially Inclusive Curricula Toolkit in HE': Empirically Measuring the Efficacy and Impact of Making Curriculum-content Racially Inclusive on the …"
#> [22] "Discussion and Concluding Comments"
#> [23] "Football professionals, qualifications and post-playing career preparations"
#> [24] "Author interview: Q and A with Dr Paul Ian Campbell, author of education, retirement and career transitions for 'black'ex-professional footballers"
#> [25] "“SHEA BUTTER” FROM B. PARKII STUDIES ON EXTRACTION METHODS"
#> [26] "'Black Lives Matter:'sport, race and ethnicity in challenging times"
#> [27] "Internet of Things (IoT) Model for the Detection of an Infectious Disease (COVID-19)"
#> [28] "Working-class Soccer Schoolboys, Race and Education in England"
#> [29] "Retirement, Training and Transitions into Non-sport Work"
#> [30] "Ethnicity, Community and 'Local'Football"
#> [31] "ERGO: an integrated, user-friendly model for computing energy and greenhouse gas budgets of bioenergy systems"
#> [32] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup."
#> [33] "Intracellular Threshhold for Ionized Mg^ sup 2+^ in Rat Platelets"
#> [34] "Sport, Race and Ethnicity at a time of multiple global crises."
Created on 2024-10-23 with reprex v2.1.1