pythonweb-scrapingweb-crawlerscrapyduplication

Stop Scrapy crawling the same URLs


I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?

The Spider is as below:

class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
    "http://www.lsbu.ac.uk"
]
rules = [
    Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]

def parse_item(self, response):
    join = Join()
    sel = Selector(response)
    bits = sel.xpath('//*')
    scraped_bits = []            
    for bit in bits:
        scraped_bit = LsbuItem()
        scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
        scraped_bit['desc'] = join(bit.xpath('//*[@id="main_content_main_column"]//text()').extract()).strip()
        scraped_bits.append(scraped_bit)

    return scraped_bits

My settings.py file looks like this

BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'

Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...

As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).

Thanks...


Solution

  • The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.

    I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)

    $ pip install scrapy --pre
    

    The simplified spider I tested:

    from scrapy.spiders import CrawlSpider
    from scrapy.linkextractors import LinkExtractor
    from scrapy import Item, Field
    from scrapy.spiders import Rule 
    
    class LsbuItem(Item):
        title = Field()
        url = Field()
    
    class LsbuSpider(CrawlSpider):
        name = "lsbu6"
        allowed_domains = ["lsbu.ac.uk"]
    
        start_urls = [
            "http://www.lsbu.ac.uk"
        ]    
    
        rules = [
            Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
        ]    
    
        def parse_item(self, response):
            scraped_bit = LsbuItem()
            scraped_bit['url'] = response.url
            yield scraped_bit