I'm trying to do some webscraping of the IMDB with rvest, and I often encounter a problem with the language output, probably due to my location in Japan.
For example, when trying to scrape the movie titles from this page:
https://www.imdb.com/chart/top/?ref_=nv_mv_250
with the following code:
library(rvest)
library(tidyverse)
url <- "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
read_html(url) %>%
html_nodes(".titleColumn a") %>%
html_text() %>%
tibble(title = .) %>%
head()
The result is a mixture of English and Japanese titles of the movies romanized:
title
<chr>
1 Shôshanku no sora ni
2 Goddofâzâ
3 The Godfather: Part II
4 Dâku naito
5 12 Angry Men
6 Schindler's List
This is the case even though the text on my screen, and even when I inspect the elements using Chrome's developer tools, are all in English.
I guess the issue is similar to the one posted on SO here with reference to scraping using PHP.
Is there a way to request English output, preferably in a tidyverse friendly pipe chain?
Try,
library(rvest)
library(tidyverse)
library(httr)
GET(url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
, add_headers(.headers = c('user_agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
, 'Accept_language' = 'en-US,en;q=0.9'))) %>%
read_html() %>%
html_nodes(".titleColumn a") %>%
html_text() %>%
tibble(title = .) %>%
head()
# A tibble: 6 x 1
title
<chr>
1 The Shawshank Redemption
2 The Godfather
3 The Godfather: Part II
4 The Dark Knight
5 12 Angry Men
6 Schindler's List