I'm starting to learn how to use Scrapy www.scrapy.org.
My problem is that I'm trying to extract information from a link inside another link.
The flow is like this:
We enter www.imdb.com, then on the menu click on Watchlist > IMDbtop250, after that we will end up in http://www.imdb.com/chart/top where we will find a list of movies;
I'm trying to enter each movie which has a link like this www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=1EX7BT4EGCE6HVGF919H&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1
and then enters full cast link from movie that looks like this www.imdb.com/title/tt0111161/fullcredits?ref_=tt_cl_sm#cast
, and starts extracting all the actors from the last link, so the problem is that I know how to extract the info but struggling with the navigation of the links, this is the code I have right now
# -*- coding: utf-8 -*-
from scrapy import item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ActorsSpider(CrawlSpider):
name = "actors"
allowed_domains = ["www.imdb.com"]
start_urls = ['http://www.imdb.com/chart/top',
'http://www.imdb.com/title/']
def parse(self, response):
rules = {
Rule(LinkExtractor(allow=r'/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2398042102&pf_rd_r=0BP5GZ1CWDNT2NFAWKDN&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1')),
Rule(LinkExtractor(allow=r'fullcredits?ref_=tt_cl_sm#cast'), callback='parse_actor'),
}
def parse_actor(self, response):
item['title'] = response.css('title').extract()[0]
return item
I'm aware that this is supposed to be done in a recursive way but first I'm actually trying to make the links work, and that both links I'm trying to enter share this characteristic /title/tt0111161/
at least for the first link.
Also, I'm just extracting title, for now, to know if I am where I want to be.
Thanks in advance for any help.
Removed some links because I don't have 10 reputation YET.
Your allowed_domains is wrong, it must be :
allowed_domains = ["imdb.com"]
Start with the top rated movies
start_urls = ['http://www.imdb.com/chart/top/']
Parse each movie and prepare the url for the actors list
def parse(self, response):
for film in response.css('.titleColumn'):
url = film.css('a::attr(href)').extract_first()
actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
yield scrapy.Request(actors_url, self.parse_actor)
Then find all actors
def parse_actor(self, response):
item = ImdbItem()
item['title'] = response.css('h3[itemprop~=name] a::text').extract_first()
item['actors'] = response.css('td[itemprop~=actor] span::text').extract()
return item