python-2.7scrapy

Scrapy webcrawler gets caught in infinite loop, despite initially working.


Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.

import scrapy

class ParadiseSpider(scrapy.Spider):
    name = "testcrawl2"
    start_urls = [
        "http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
        ]
    def __init__(self):
        self.found = 0 
        self.goto = "no"
        
    def parse(self, response):
        urlthing = response.xpath("//a[@title='Next page']").extract()
        urlthing = urlthing.pop()
        newurl = urlthing.split()
        print newurl
        url = newurl[1]
        url = url.replace("href=", "")
        url = url.replace('"', "")
        url = "http://forums.somethingawful.com/" + url
        print url
        self.goto = url
        return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)  
    
            
    def parse_save(self, response):
        nfound = str(self.found)
        print "Testing" + nfound
        self.found = self.found + 1
        return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)

Solution

  • Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page

      rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
    
      def parse(self,response):
           response.url