rweb-scrapinghttr

Scrape the university name (in QS World University Rankings website) with R


I want to scrape 2021 University Ranking data from the QS World University Rankings website (https://www.topuniversities.com/university-rankings/world-university-rankings/2021) with R.

I found the latest worked code (ref: https://www.reddit.com/r/webscraping/comments/ns1l8m/web_scraping_qs_data_with_rvest/)

library(httr)

response <- GET("https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402")

data_json <- content(response, encoding = "UTF-8")

data <- jsonlite::fromJSON(data_json)

df <- data.frame(data)

But when I run this code today, I found some error notification

Error: Argument 'txt' must be a JSON string, URL or file.

What is the function to extract the data (https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402) to a data frame?


Solution

  • As Mark pointed out, the site uses CloudFlare protection, which rules out static scrapers and http clients like httr or rvest. User agreement is also explicit regarding any automated access and scraping. What makes it a bit controversial is the data copyright - https://www.topuniversities.com/data-copyright , CC BY-NC-ND (copy - good; derivatives - bad). So retrieval from 3rd parties for personal use seems fine, and the site holder has presumably granted access to some crawlers -- at some point there was a CloudFlare banner about the site being offline and content being served from WayBackMachine snapshot.

    And that particular dataset is also mirrored by archive.org, last save is from 2023-03-28.
    For easy retrieval we could use archiveRetriever package:

    library(archiveRetriever)
    url_ <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402"
    ranking_mementos <- retrieve_urls(homepage  = url_,
                                      startDate = "2023-01-01",
                                      endDate   = format(Sys.Date()))
    ranking_mementos
    #> [1] "http://web.archive.org/web/20230328102439/https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402"
    latest <- ranking_mementos[length(ranking_mementos)]
    jsonlite::fromJSON(latest)$data |> tibble::as_tibble()
    #> # A tibble: 1,184 × 13
    #>    core_id country city  guide nid   title logo  score rank_display region stars
    #>    <chr>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr>        <chr>  <chr>
    #>  1 410     United… Camb… ""    2948… "<di… /sit… 100   1            North… ""   
    #>  2 573     United… Stan… ""    2972… "<di… /sit… 98.4  2            North… ""   
    #>  3 253     United… Camb… ""    2942… "<di… /sit… 97.9  3            North… ""   
    #>  4 94      United… Pasa… ""    2945… "<di… /sit… 97    4            North… ""   
    #>  5 478     United… Oxfo… ""    2946… "<di… /sit… 96.7  5            Europe ""   
    #>  6 201     Switze… Züri… ""    2944… "<di… /sit… 95    6            Europe ""   
    #>  7 95      United… Camb… ""    2945… "<di… /sit… 94.3  7            Europe ""   
    #>  8 356     United… Lond… ""    2940… "<di… /sit… 93.6  8            Europe ""   
    #>  9 120     United… Chic… ""    2945… "<di… /sit… 93.1  9            North… ""   
    #> 10 365     United… Lond… ""    2940… "<di… /sit… 92.9  10           Europe ""   
    #> # ℹ 1,174 more rows
    #> # ℹ 2 more variables: recm <chr>, dagger <lgl>
    

    Created on 2023-06-28 with reprex v2.0.2

    At the time of writing, the current online version is identical to this archived one. However, note that it does not seem to be in use anymore when serving regular user-initiated web requests.