rtidyverservestimdb

Choose the language of text results when scraping with rvest (IMDB example)


I'm trying to do some webscraping of the IMDB with rvest, and I often encounter a problem with the language output, probably due to my location in Japan.

For example, when trying to scrape the movie titles from this page:

https://www.imdb.com/chart/top/?ref_=nv_mv_250

with the following code:

library(rvest)
library(tidyverse)    
url <- "https://www.imdb.com/chart/top/?ref_=nv_mv_250"

read_html(url) %>% 
    html_nodes(".titleColumn a") %>% 
    html_text() %>% 
    tibble(title = .) %>% 
    head()

The result is a mixture of English and Japanese titles of the movies romanized:

  title                 
  <chr>                 
1 Shôshanku no sora ni  
2 Goddofâzâ             
3 The Godfather: Part II
4 Dâku naito            
5 12 Angry Men          
6 Schindler's List 

This is the case even though the text on my screen, and even when I inspect the elements using Chrome's developer tools, are all in English.

I guess the issue is similar to the one posted on SO here with reference to scraping using PHP.

Is there a way to request English output, preferably in a tidyverse friendly pipe chain?


Solution

  • Try,

        library(rvest)
        library(tidyverse) 
        library(httr) 
    
        GET(url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
                      , add_headers(.headers = c('user_agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
                                            , 'Accept_language' = 'en-US,en;q=0.9'))) %>% 
              read_html() %>% 
              html_nodes(".titleColumn a") %>% 
              html_text() %>% 
              tibble(title = .) %>% 
              head()
        # A tibble: 6 x 1
          title                   
          <chr>                   
        1 The Shawshank Redemption
        2 The Godfather           
        3 The Godfather: Part II  
        4 The Dark Knight         
        5 12 Angry Men            
        6 Schindler's List