scrapy

How to retry the request n times when an item gets an empty field?


I'm trying to scrap a range of webpages but I got holes, sometimes it looks like the website fails to send the html response correctly. This results in the csv output file to have empty lines. How would one do to retry n times the request and the parse when the xpath selector on the response is empty ? Note that I don't have any HTTP errors.


Solution

  • you could do this with a Custom Retry Middleware, you just need to override the process_response method of the current Retry Middleware:

    from scrapy.downloadermiddlewares.retry import RetryMiddleware
    from scrapy.utils.response import response_status_message
    
    
    class CustomRetryMiddleware(RetryMiddleware):
    
        def process_response(self, request, response, spider):
            if request.meta.get('dont_retry', False):
                return response
            if response.status in self.retry_http_codes:
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response
    
            # this is your check
            if response.status == 200 and response.xpath(spider.retry_xpath):
                return self._retry(request, 'response got xpath "{}"'.format(spider.retry_xpath), spider) or response
            return response
    

    Then enable it instead of the default RetryMiddleware in settings.py:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
        'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
    }
    

    Now you have a middleware where you can configure the xpath to retry inside your spider with the attribute retry_xpath:

    class MySpider(Spider):
        name = "myspidername"
    
        retry_xpath = '//h2[@class="tadasdop-cat"]'
        ...
    

    This won't necessarily retry when your Item's field is empty, but you can specify the same path of that field in this retry_xpath attribute to make it work.