rweb-scrapingrveststringrimdb

Why is R Web scraping code to pick all cast members and directors on the IMDB website not working?


I want to scrape data from multiple pages of the IMDB website to get movie information on the Top Nigerian movies by popularity. I have been able to successfully get the title, year, synopsis, genre, certificate. However, I am having issues doing the same for the cast members and directors.

This is the main imdb link https://www.imdb.com/search/title/?country_of_origin=NG&start=1&ref_=adv_prv

then I want to go into the page of each individual movie and pull out the full list of the cast and main directors

for example, the first movie on the list is "The Trade", I want to go into this page: https://www.imdb.com/title/tt8803398/fullcredits/?ref_=tt_cl_sm and extract the full names of all the cast members and directors,

This is what I did to get the title, year, synopsis, genre, and certificate:

library(rvest)
library(tidyverse)

movies6 = data.frame()

for(page_result in seq(from = 1, to = 201, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)

  df <- page %>% 
  html_nodes(".mode-advanced") %>% 
  map_df(~list(title = html_nodes(.x, '.lister-item-header a') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               year = html_nodes(.x, '.text-muted.unbold') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               genre = html_nodes(.x, '.genre') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               certificate = html_nodes(.x, '.certificate') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               rating = html_nodes(.x, '.ratings-imdb-rating strong') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .},
               synopsis = html_nodes(.x, '.ratings-bar+ .text-muted') %>% 
                     html_text() %>% 
                     {if(length(.) == 0) NA else .}))
              

movies6 = rbind(movies6, df)
print(paste("Page:", page_result))

}

It worked well and this was the result

(https://i.sstatic.net/xEUpo.jpg)

Then this is what I attempted to get the complete list of the movie cast

library(rvest)
library(tidyverse)
library(stringr)


get_cast = function(movie_link) {
  movie_page = read_html(movie_link)
  movie_cast = movie_page %>% html_nodes(".primary_photo+ td a") %>%
    html_text() %>% paste(collapse = ",")
  return(movie_cast)
}

movies5 = data.frame()

for(page_result in seq(from = 1, to = 151, by = 50)){
  
  link = paste0("https://www.imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
  
  page <- read_html(link)
  
  movie_links = page %>% html_nodes(".lister-item-header a") %>%
    html_attr("href") %>%
    str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
    paste("http://www.imdb.com", ., sep="")
  

  cast = sapply(movie_links, FUN = get_cast, USE.NAMES = FALSE)

  movies5 = rbind(movies5, data.frame(cast = ifelse(length(cast)==0,NA,cast)))


print(paste("Page:", page_result))

}

But this is the result I am getting. Only the cast of the first movie per page is populating the list. The cast of the remaining 49 movies of each page isn't working. I modified the code to get the complete list of directors, but in a weird way, it brings out the cast instead, with the same issue as before.

(https://i.sstatic.net/XCMaJ.jpg)

I would really appreciate it if someone could assist me on what exactly to do regarding scraping data on the cast and directors. I have tried so many things that didn't work.


Solution

  • I was able to do this and it worked

    get_cast = function(movie_link) {
          movie_page = read_html(movie_link)
          cast = movie_page %>% html_nodes(".cast_list tr:not(:first-child) td:nth-child(2) a") %>% html_text() %>% paste(collapse = ",")
          directors = movie_page %>% html_nodes("h4:contains('Directed by') + table a") %>% html_text() %>% paste(collapse = ",")
          return(data.frame(cast = cast, directors = directors))
        }
    
    movies2 = data.frame()
    
    for(page_result in seq(from = 1, to = 951, by = 50)){
          link = paste0("https://imdb.com/search/title/?country_of_origin=NG&start=", page_result, "&ref_=adv_nxt")
          page <- read_html(link)
          movie_links = page %>% html_nodes(".lister-item-header a") %>% html_attr("href") %>% str_replace(pattern = fixed("?ref_=adv_li_tt"), replacement = fixed("fullcredits/?ref_=tt_cl_sm")) %>%
            paste("http://www.imdb.com", ., sep="")
          movie_data = lapply(movie_links, get_cast)
          df = bind_rows(movie_data)
          movies2 = rbind(movies2, df)
    
    
        print(paste("Page:", page_result))
        }