I am currently working with the Scrapy Python library.
First I make a FormRequest call to the Fitbit's login page (https://www.fitbit.com/login) to log myself in. Then I make close to 100 requests to the Fitbit's API (https://api.fitbit.com).
To not stress out the API (and to not get banned from it!), I wanted to set a delay between the requests using DOWNLOAD_DELAY in the settings.py file. However it is not working.
I was testing it in the tutorials (http://scrapy.readthedocs.io/en/latest/intro/tutorial.html) and it was working properly there.
What do you think? Is it because I request an API (supposed to handle those kinds of accesses)?
EDIT: here is the pseudo code of my spider:
class FitbitSpider:
start_urls = ["https://www.fitbit.com/login"]
def parse(self, response):
yield scrapy.FormRequest(url,formdata,callback=after_login)
def after_login(self, response):
for i in range(100):
yield scrapy.Request("https://api.fitbit.com/[...]")
EDIT 2: here is my settings.py file:
BOT_NAME = 'fitbitscraper'
SPIDER_MODULES = ['fitbitscraper.spiders']
NEWSPIDER_MODULE = 'fitbitscraper.spiders'
DOWNLOAD_DELAY = 20 #20 seconds of delay should be pretty noticeable
Alright, I just found the answer to my problem.
It came from the creation of a CrawlerProcess in the main.py file I was running. It did not load the settings in the settings.py file.
Beforehand I was doing the following:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
Now if I change the CrawlerProcess into:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
I do get the wanted delay !
Note: using get_project_settings()
to create the CrawlerProcess is not working either.