pythonregexweb-scrapingscrapy

Can't get Scrapy Crawlspider to follow links


I'm trying to get the 'Rules' section of a Scrapy Crawlspider working properly.

I've found the xpath that returns the links I want to follow. It's

//*[@class="course_detail"]//td[4]/a/@href

and it returns about 2700 URL's in total.

Basically, I'm trying to tell the spider to follow everything that matches that xpath, but I can't get the following code to work properly:

rules = (
    Rule(SgmlLinkExtractor( allow=[r'.*'],
                            restrict_xpaths='//*[@class="course_detail"]//td[4]/a/@href'
                           ),              
         callback='parse_item'
         ),
)

I don't get any errors, but the spider doesn't seem to get past the page I defined in start_urls.

EDIT: Figured it out! Just had to drop the @href. Hayden's code helped too so I'm awarding him with the answer.


Solution

  • I think allow and restrict_xpaths ought to be of the same type (i.e. either both list or both strings) when passing to SgmlLinkExtractor. Most examples use tuples:

    rules = (
        Rule(SgmlLinkExtractor( allow = (r'.*',),
                                restrict_xpaths = ('//*[@class="course_detail"]//td[4]/a/@href',)
                               ),              
             callback='parse_item'
             ),
    )
    

    As an aside a like to use Egyptian Brackets to try and keep track of where my arguments.