scrapy

Outgoing and Incoming Bandwidth at regular interval of time using Scrapy


Is it possible to get stats like outgoing and incoming bandwidth used during a crawl using scrapy at regular interval of time?


Solution

  • Yes, it is possible. =)

    The total request and response bytes are already tracked in the stats by the DownloaderStats middleware that comes with Scrapy. You can add another downloader middleware that tracks time and add the new stats.

    Here are the steps for it:

    1. Configure a new downloader middleware in settings.py with a high order number so it will execute later in the pipeline:

      DOWNLOADER_MIDDLEWARES = { 'testing.middlewares.InOutBandwithStats': 990, }

    2. Put the following code into a middleware.py file in the same dir as settings.py

      import time

      class InOutBandwithStats(object):

       def __init__(self, stats):
           self.stats = stats
           self.startedtime = time.time()
      
       @classmethod
       def from_crawler(cls, crawler):
           return cls(crawler.stats)
      
       def elapsed_seconds(self):
           return time.time() - self.startedtime
      
       def process_request(self, request, spider):
           request_bytes = self.stats.get_value('downloader/request_bytes')
      
           if request_bytes:
               outgoing_bytes_per_second = request_bytes / self.elapsed_seconds()
               self.stats.set_value('downloader/outgoing_bytes_per_second',
                                    outgoing_bytes_per_second)
      
       def process_response(self, request, response, spider):
           response_bytes = self.stats.get_value('downloader/response_bytes')
      
           if response_bytes:
               incoming_bytes_per_second = response_bytes / self.elapsed_seconds()
               self.stats.set_value('downloader/incoming_bytes_per_second',
                                    incoming_bytes_per_second)
      
           return response
      

    And that's it. The process_request/process_response methods will be called whenever a request/response is processed and will keep updating the stats accordingly.

    If you want to have logs at regular times you can also call spider.log('Incoming bytes/sec: %s' % incoming_bytes_per_second) there.

    Read more