I want to scrape 2021 University Ranking data from the QS World University Rankings website (https://www.topuniversities.com/university-rankings/world-university-rankings/2021) with R.
I found the latest worked code (ref: https://www.reddit.com/r/webscraping/comments/ns1l8m/web_scraping_qs_data_with_rvest/)
library(httr)
response <- GET("https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402")
data_json <- content(response, encoding = "UTF-8")
data <- jsonlite::fromJSON(data_json)
df <- data.frame(data)
But when I run this code today, I found some error notification
Error: Argument 'txt' must be a JSON string, URL or file.
What is the function to extract the data (https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402) to a data frame?
As Mark pointed out, the site uses CloudFlare protection, which rules out static scrapers and http clients like httr
or rvest
. User agreement is also explicit regarding any automated access and scraping. What makes it a bit controversial is the data copyright - https://www.topuniversities.com/data-copyright , CC BY-NC-ND
(copy - good; derivatives - bad). So retrieval from 3rd parties for personal use seems fine, and the site holder has presumably granted access to some crawlers -- at some point there was a CloudFlare banner about the site being offline and content being served from WayBackMachine snapshot.
And that particular dataset is also mirrored by archive.org, last save is from 2023-03-28.
For easy retrieval we could use archiveRetriever
package:
library(archiveRetriever)
url_ <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402"
ranking_mementos <- retrieve_urls(homepage = url_,
startDate = "2023-01-01",
endDate = format(Sys.Date()))
ranking_mementos
#> [1] "http://web.archive.org/web/20230328102439/https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/2057712.txt?1622792449?v=1622946607402"
latest <- ranking_mementos[length(ranking_mementos)]
jsonlite::fromJSON(latest)$data |> tibble::as_tibble()
#> # A tibble: 1,184 × 13
#> core_id country city guide nid title logo score rank_display region stars
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 410 United… Camb… "" 2948… "<di… /sit… 100 1 North… ""
#> 2 573 United… Stan… "" 2972… "<di… /sit… 98.4 2 North… ""
#> 3 253 United… Camb… "" 2942… "<di… /sit… 97.9 3 North… ""
#> 4 94 United… Pasa… "" 2945… "<di… /sit… 97 4 North… ""
#> 5 478 United… Oxfo… "" 2946… "<di… /sit… 96.7 5 Europe ""
#> 6 201 Switze… Züri… "" 2944… "<di… /sit… 95 6 Europe ""
#> 7 95 United… Camb… "" 2945… "<di… /sit… 94.3 7 Europe ""
#> 8 356 United… Lond… "" 2940… "<di… /sit… 93.6 8 Europe ""
#> 9 120 United… Chic… "" 2945… "<di… /sit… 93.1 9 North… ""
#> 10 365 United… Lond… "" 2940… "<di… /sit… 92.9 10 Europe ""
#> # ℹ 1,174 more rows
#> # ℹ 2 more variables: recm <chr>, dagger <lgl>
Created on 2023-06-28 with reprex v2.0.2
At the time of writing, the current online version is identical to this archived one. However, note that it does not seem to be in use anymore when serving regular user-initiated web requests.