pythonapiscrapyfitbit

Scrapy DOWNLOAD_DELAY not working for sequential requets


I am currently working with the Scrapy Python library.

First I make a FormRequest call to the Fitbit's login page (https://www.fitbit.com/login) to log myself in. Then I make close to 100 requests to the Fitbit's API (https://api.fitbit.com).

To not stress out the API (and to not get banned from it!), I wanted to set a delay between the requests using DOWNLOAD_DELAY in the settings.py file. However it is not working.

I was testing it in the tutorials (http://scrapy.readthedocs.io/en/latest/intro/tutorial.html) and it was working properly there.

What do you think? Is it because I request an API (supposed to handle those kinds of accesses)?

EDIT: here is the pseudo code of my spider:

class FitbitSpider:
    start_urls = ["https://www.fitbit.com/login"]

    def parse(self, response):
        yield scrapy.FormRequest(url,formdata,callback=after_login)

    def after_login(self, response):
        for i in range(100):
            yield scrapy.Request("https://api.fitbit.com/[...]")

EDIT 2: here is my settings.py file:

BOT_NAME = 'fitbitscraper'

SPIDER_MODULES = ['fitbitscraper.spiders']
NEWSPIDER_MODULE = 'fitbitscraper.spiders'

DOWNLOAD_DELAY = 20 #20 seconds of delay should be pretty noticeable 

Solution

  • Alright, I just found the answer to my problem.

    It came from the creation of a CrawlerProcess in the main.py file I was running. It did not load the settings in the settings.py file.

    Beforehand I was doing the following:

    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })
    
    process.crawl(fitbit_spider.FitbitSpider)
    process.start()
    

    Now if I change the CrawlerProcess into:

    process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
            'DOWNLOAD_DELAY': 20
    })
    

    I do get the wanted delay !

    Note: using get_project_settings() to create the CrawlerProcess is not working either.