Similar to what is done in the link: How can i use multiple requests and pass items in between them in scrapy python
I am trying to chain requests from spiders like in Dave McLain's answer. Returning a request object from parse function works fine, allowing the spider to continue with the next request.
def parse(self, response):
# Some operations
self.url_index += 1
if self.url_index < len(self.urls):
return scrapy.Request(url=self.urls[self.url_index], callback=self.parse)
return items
However, I have the default Spider Middleware where I do some caching and logging operations in the spider_process_output. Returning the request object from the parse function first goes into middleware. So, the middleware has to return the request object as well.
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
if hasattr(spider, 'multiple_urls'):
if spider.url_index + 1 < len(spider.urls):
return [result]
# return [scrapy.Request(url=spider.urls[spider.url_index], callback=spider.parse)]
# Some operations ...
According to the documentation, it must return iterable of Request, or item objects. However, when I return the result (which is a Request object), or construct a new request object (as in the comment), the spider just terminates (by giving spider finished signal) without making a new request.
Documentation link: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#writing-your-own-spider-middleware
I am not sure if there is an issue with the documentation or the way I interpret it. But, returning request objects from the middleware doesn't make new request, instead it terminates the flow.
It was quite simple yet frustrating to solve the problem. The middleware is supposed to return iterable of request objects. However, putting the request object into a list (which is an iterable) doesn't seem to work. Using yield result
in the process_spider_output middleware function instead works.
Since the main issue is resolved, I'll leave this answer as a reference. Better explanations of why this is the case are appreciated.