rloopsweb-scrapingiterationrvest

Iteration in web-scraping of google scholar


I am seeking to use R for web-scraping of google scholar, for instances where someone does not have a public profile.

One challenge is that there's a limit of 10 results at a time -- so, for someone with a lot of publications, multiple lines of code are required. I'd like to iterate where possible.

Here's the 'manual' version:

packages <- c("rvest", "xml2", "curl", "data.table")
lapply(packages, library, character.only = T)

#set up searches, with a "start" counter for the successive pages of results:
url_name0 <- 'http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998'
url_name1 <- 'http://scholar.google.com/scholar?start=10&q=author:"pi+campbell"&as_ylo=1998'
url_name2 <- 'http://scholar.google.com/scholar?start=20&q=author:"pi+campbell"&as_ylo=1998'
url_name3 <- 'http://scholar.google.com/scholar?start=30&q=author:"pi+campbell"&as_ylo=1998'

#scrape
wp0 <- xml2::read_html(url_name0)
wp1 <- xml2::read_html(url_name1)
wp2 <- xml2::read_html(url_name2)
wp3 <- xml2::read_html(url_name3)

# Extract raw data (titles)
titles0 <- rvest::html_text(rvest::html_nodes(wp0, '.gs_rt'))
titles1 <- rvest::html_text(rvest::html_nodes(wp1, '.gs_rt'))
titles2 <- rvest::html_text(rvest::html_nodes(wp2, '.gs_rt'))
titles3 <- rvest::html_text(rvest::html_nodes(wp3, '.gs_rt'))

Now, for the latter section, it should be possible to iterate. I've tried a for-loop:

counter <- 0:3

titles <- vector("list", length(counter))

for (i in seq_along(counter)) {
    titles[[i]] <- rvest::html_text(rvest::html_nodes(wp[[i]], '.gs_rt'))
}

But this yields an error:

Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "NULL"

(At an earlier stage I was getting a different error: )

Error in html_elements(...) : object 'wp' not found

There are subsequent operations that have a similar structure -- so if I can figure out iteration for this one, I should be able to write some more efficient code...


Solution

  • You are currently storing html_documents from read_html() in individual objects (wp0 , wp1 .. ), but in your loop you attempt to access wp list that you have not set up / filled yet.

    Perhaps something bit less manual, e.g. navigating through pages with rvest::session() :

    library(rvest)
    
    s <- session('http://scholar.google.com/scholar?start=0&q=author:"pi+campbell"&as_ylo=1998') 
    
    # Extract aprox. number of results (including citations) 
    about_n_results <- 
      html_elements(s, "#gs_ab_md .gs_ab_mdw") |> 
      html_text() |> 
      stringr::str_extract("(?<=About )\\d+") |> 
      as.integer()
    
    # Storage list
    titles <- vector(mode = "list", length = ceiling(about_n_results / 10))
    
    # Extract titles, store in a list, try to follow Next link;
    # repeat while there is a Next link, 
    # break when number of returned titles is < 10 (presumably only citations will follow)
    while (!is.null(s)) {
      current_page_n <- 
        html_elements(s, "span.gs_ico_nav_current ~ b") |> 
        html_text(trim = TRUE) |> 
        as.integer()
      
      titles[[current_page_n]] <- html_elements(s, '.gs_rt a') |> html_text(trim = TRUE)
      if (length(titles[[current_page_n]]) < 10) break
      
      # Follow Next link, return NULL when there isn't one
      s <- 
        tryCatch(
          error = function(cnd) NULL,
          session_follow_link(s, xpath = "//b[text() = 'Next']/../../a")
        )
    }
    #> Navigating to
    #> </scholar?start=10&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
    #> Navigating to
    #> </scholar?start=20&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
    #> Navigating to
    #> </scholar?start=30&q=author:%22pi+campbell%22&hl=en&oe=ASCII&as_sdt=0,5&as_ylo=1998>.
    
    unlist(titles)
    #>  [1] "'Black Lives Matter:'sport, race and ethnicity in challenging times"                                                                                                 
    #>  [2] "'Pray (ing) the person marking your work isn't racist': racialised inequities in HE assessment practice"                                                             
    #>  [3] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup"                                     
    #>  [4] "Sport, race and ethnicity in the wake of black lives matter: Introduction to the special issue"                                                                      
    #>  [5] "A ('black') historian using sociology to write a history of 'black'sport: A critical reflection"                                                                     
    #>  [6] "Education, Retirement and Career Transitions for'Black'Ex-Professional Footballers: From Being Idolised to Stacking Shelves"                                         
    #>  [7] "Black British Students' Experiences of Assessment"                                                                                                                   
    #>  [8] "'That black boy's different class!': a historical sociology of the black middle-classes, boundary-work and local football in the British East-Midlands c. 1970− 2010"
    #>  [9] "White British Students' Experiences of Assessment"                                                                                                                   
    #> [10] "'Race', politics and local football–continuity and change in the life of a British African-Caribbean local football club"                                            
    #> [11] "THE EFFECTS OF RACIALLY INCLUSIVE ASSESSMENT ON THE RACE AWARD GAP AND ON STUDENTS'LIVED EXPERIENCES OF ASSESSMENT"                                                  
    #> [12] "Cavaliers Made Us 'United': Local Football, Identity Politics and Second-generation African-Caribbean Youth in the East Midlands c.1970–9"                           
    #> [13] "Electrolytes and pH changes in pre-eclamptic rats"                                                                                                                   
    #> [14] "Racially Inclusive Assessment and Academic Teaching Staff"                                                                                                           
    #> [15] "Race and Assessment in Higher Education: From Conceptualising Barriers to Making Measurable Change"                                                                  
    #> [16] "Conceptualising Inter-and Intra-Race-Based Barriers in Assessment"                                                                                                   
    #> [17] "Afterword: 12 Years a Black Race Inclusion Academic – Some Reflections on Working in a 'Postracism' Space"                                                           
    #> [18] "'White digital footballers can't jump': (re)constructions of race in FIFA 20"                                                                                        
    #> [19] "British South Asian Students' Experiences of Assessment"                                                                                                             
    #> [20] "'Policy Shorts': Mapping and 'Tackling'Racial Inequities in HE Assessment–Summarising the Case Study"                                                                
    #> [21] "Evaluating the Racially Inclusive Curricula Toolkit in HE': Empirically Measuring the Efficacy and Impact of Making Curriculum-content Racially Inclusive on the …"  
    #> [22] "Discussion and Concluding Comments"                                                                                                                                  
    #> [23] "Football professionals, qualifications and post-playing career preparations"                                                                                         
    #> [24] "Author interview: Q and A with Dr Paul Ian Campbell, author of education, retirement and career transitions for 'black'ex-professional footballers"                  
    #> [25] "“SHEA BUTTER” FROM B. PARKII STUDIES ON EXTRACTION METHODS"                                                                                                          
    #> [26] "'Black Lives Matter:'sport, race and ethnicity in challenging times"                                                                                                 
    #> [27] "Internet of Things (IoT) Model for the Detection of an Infectious Disease (COVID-19)"                                                                                
    #> [28] "Working-class Soccer Schoolboys, Race and Education in England"                                                                                                      
    #> [29] "Retirement, Training and Transitions into Non-sport Work"                                                                                                            
    #> [30] "Ethnicity, Community and 'Local'Football"                                                                                                                            
    #> [31] "ERGO: an integrated, user-friendly model for computing energy and greenhouse gas budgets of bioenergy systems"                                                       
    #> [32] "'He is like a Gazelle (when he runs)'(re) constructing race and nation in match-day commentary at the men's 2018 FIFA World Cup."                                    
    #> [33] "Intracellular Threshhold for Ionized Mg^ sup 2+^ in Rat Platelets"                                                                                                   
    #> [34] "Sport, Race and Ethnicity at a time of multiple global crises."
    

    Created on 2024-10-23 with reprex v2.1.1