I'm trying to get the 'Rules' section of a Scrapy Crawlspider working properly.
I've found the xpath
that returns the links I want to follow. It's
//*[@class="course_detail"]//td[4]/a/@href
and it returns about 2700 URL's in total.
Basically, I'm trying to tell the spider to follow everything that matches that xpath
, but I can't get the following code to work properly:
rules = (
Rule(SgmlLinkExtractor( allow=[r'.*'],
restrict_xpaths='//*[@class="course_detail"]//td[4]/a/@href'
),
callback='parse_item'
),
)
I don't get any errors, but the spider doesn't seem to get past the page I defined in start_urls
.
EDIT: Figured it out! Just had to drop the @href. Hayden's code helped too so I'm awarding him with the answer.
I think allow
and restrict_xpaths
ought to be of the same type (i.e. either both list or both strings) when passing to SgmlLinkExtractor. Most examples use tuples
:
rules = (
Rule(SgmlLinkExtractor( allow = (r'.*',),
restrict_xpaths = ('//*[@class="course_detail"]//td[4]/a/@href',)
),
callback='parse_item'
),
)
As an aside a like to use Egyptian Brackets to try and keep track of where my arguments.