[SOLVED] Outgoing and Incoming Bandwidth at regular interval of time using Scrapy

Outgoing and Incoming Bandwidth at regular interval of time using Scrapy

Is it possible to get stats like outgoing and incoming bandwidth used during a crawl using scrapy at regular interval of time?

Solution

Yes, it is possible. =)

The total request and response bytes are already tracked in the stats by the DownloaderStats middleware that comes with Scrapy. You can add another downloader middleware that tracks time and add the new stats.

Here are the steps for it:

Configure a new downloader middleware in settings.py with a high order number so it will execute later in the pipeline:

DOWNLOADER_MIDDLEWARES = { 'testing.middlewares.InOutBandwithStats': 990, }

Put the following code into a middleware.py file in the same dir as settings.py

import time

class InOutBandwithStats(object):

 def __init__(self, stats):
     self.stats = stats
     self.startedtime = time.time()

 @classmethod
 def from_crawler(cls, crawler):
     return cls(crawler.stats)

 def elapsed_seconds(self):
     return time.time() - self.startedtime

 def process_request(self, request, spider):
     request_bytes = self.stats.get_value('downloader/request_bytes')

     if request_bytes:
         outgoing_bytes_per_second = request_bytes / self.elapsed_seconds()
         self.stats.set_value('downloader/outgoing_bytes_per_second',
                              outgoing_bytes_per_second)

 def process_response(self, request, response, spider):
     response_bytes = self.stats.get_value('downloader/response_bytes')

     if response_bytes:
         incoming_bytes_per_second = response_bytes / self.elapsed_seconds()
         self.stats.set_value('downloader/incoming_bytes_per_second',
                              incoming_bytes_per_second)

     return response

And that's it. The process_request/process_response methods will be called whenever a request/response is processed and will keep updating the stats accordingly.

If you want to have logs at regular times you can also call spider.log('Incoming bytes/sec: %s' % incoming_bytes_per_second) there.

Outgoing and Incoming Bandwidth at regular interval of time using Scrapy

Read more