pythonweb-scrapingscrapyhrefrules

Scrapy - How do i extract info from nested links


I'm starting to learn how to use Scrapy www.scrapy.org.

My problem is that I'm trying to extract information from a link inside another link.

The flow is like this:

We enter www.imdb.com, then on the menu click on Watchlist > IMDbtop250, after that we will end up in http://www.imdb.com/chart/top where we will find a list of movies;

I'm trying to enter each movie which has a link like this www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=1EX7BT4EGCE6HVGF919H&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1 and then enters full cast link from movie that looks like this www.imdb.com/title/tt0111161/fullcredits?ref_=tt_cl_sm#cast, and starts extracting all the actors from the last link, so the problem is that I know how to extract the info but struggling with the navigation of the links, this is the code I have right now

    # -*- coding: utf-8 -*-
from scrapy import item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ActorsSpider(CrawlSpider):
    name = "actors"
    allowed_domains = ["www.imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top',
                  'http://www.imdb.com/title/']

    def parse(self, response):
        rules = {
            Rule(LinkExtractor(allow=r'/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0BP5GZ1CWDNT2NFAWKDN&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1')),
            Rule(LinkExtractor(allow=r'fullcredits?ref_=tt_cl_sm#cast'), callback='parse_actor'),
        }

    def parse_actor(self, response):
        item['title'] = response.css('title').extract()[0]
        return item

I'm aware that this is supposed to be done in a recursive way but first I'm actually trying to make the links work, and that both links I'm trying to enter share this characteristic /title/tt0111161/ at least for the first link.

Also, I'm just extracting title, for now, to know if I am where I want to be.

Thanks in advance for any help.

Removed some links because I don't have 10 reputation YET.


Solution

  • Your allowed_domains is wrong, it must be :

    allowed_domains = ["imdb.com"]
    

    Start with the top rated movies

    start_urls = ['http://www.imdb.com/chart/top/']
    

    Parse each movie and prepare the url for the actors list

    def parse(self, response):
            for film in response.css('.titleColumn'):
                url = film.css('a::attr(href)').extract_first()
                actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
                yield scrapy.Request(actors_url, self.parse_actor)
    

    Then find all actors

    def parse_actor(self, response):
            item = ImdbItem()
            item['title'] = response.css('h3[itemprop~=name] a::text').extract_first()
            item['actors'] = response.css('td[itemprop~=actor] span::text').extract()
            return item