pythonmemoryscrapyscrapinghub

Scrapy hidden memory leak


Background - TLDR: I have a memory leak in my project

Spent a few days looking through the memory leak docs with scrapy and can't find the problem. I'm developing a medium size scrapy project, ~40k requests per day.

I am hosting this using scrapinghub's scheduled runs.

On scrapinghub, for $9 per month, you are essentially given 1 VM, with 1GB of RAM, to run your crawlers.

I've developed a crawler locally and uploaded to scrapinghub, the only problem is that towards the end of the run, I exceed the memory.

Localling setting CONCURRENT_REQUESTS=16 works fine, but leads to exceeding the memory on scrapinghub at the 50% point. When I set CONCURRENT_REQUESTS=4, I exceed the memory at the 95% point, so reducing to 2 should fix the problem, but then my crawler becomes too slow.

The alternative solution, is paying for 2 VM's, to increase the RAM, but I have a feeling that the way I've set up my crawler is causing memory leaks.

For this example, the project will scrape an online retailer. When run locally, my memusage/max is 2.7gb with CONCURRENT_REQUESTS=16.

I will now run through my scrapy structure

  1. Get the total number of pages to scrape
  2. Loop through all these pages using: www.example.com/page={page_num}
  3. On each page, gather information on 48 products
  4. For each of these products, go to their page and get some information
  5. Using that info, call an API directly, for each product
  6. Save these using an item pipeline (locally I write to csv, but not on scrapinghub)
    class Pipeline(object):
        def process_item(self, item, spider):
            item['stock_jsons'] = json.loads(item['stock_jsons'])['subProducts']
            return item
    class mainItem(scrapy.Item):
        date = scrapy.Field()
        url = scrapy.Field()
        active_col_num = scrapy.Field()
        all_col_nums = scrapy.Field()
        old_price = scrapy.Field()
        current_price = scrapy.Field()
        image_urls_full = scrapy.Field()
        stock_jsons = scrapy.Field()
    
    class URLItem(scrapy.Item):
        urls = scrapy.Field()
    class ProductSpider(scrapy.Spider):
        name = 'product'
        def __init__(self, **kwargs):
            page = requests.get('www.example.com', headers=headers)
            self.num_pages = # gets the number of pages to search
    
        def start_requests(self):
            for page in tqdm(range(1, self.num_pages+1)):
                url = 'www.example.com/page={page}'
                yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)

        def prod_url(self, response):
            urls_item = URLItem()
            extracted_urls = response.xpath(####).extract() # Gets URLs to follow
            urls_item['urls'] = [# Get a list of urls]
            for url in urls_item['urls']:
                    yield scrapy.Request(url = url, headers=headers, callback = self.parse)

        def parse(self, response) # Parse the main product page
            item = mainItem()
            item['date'] = DATETIME_VAR
            item['url'] = response.url
            item['active_col_num'] = XXX
            item['all_col_nums'] = XXX
            item['old_price'] = XXX
            item['current_price'] = XXX
            item['image_urls_full'] = XXX

            try:
                new_url = 'www.exampleAPI.com/' + item['active_col_num']
            except TypeError:
                new_url = 'www.exampleAPI.com/{dummy_number}'
        
            yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item})


        def parse_attr(self, response):
        ## This calls an API Step 5
            item = response.meta['item']
            item['stock_jsons'] = response.text
            yield item

What I've tried so far?

HtmlResponse                       31   oldest: 3s ago
mainItem                            18   oldest: 5s ago
ProductSpider                       1   oldest: 3321s ago
Request                            43   oldest: 105s ago
Selector                           16   oldest: 3s ago

QUESTIONS

Please let me know if there is any more information required

Additional Information Requested

Please let me know if the output from scrapinghub is required, I think it should be the same, but the message for finish reason, is memory exceeded.

1.Log lines from start(from INFO: Scrapy xxx started to spider opened).

2020-09-17 11:54:11 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: PLT)
2020-09-17 11:54:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-09-17 11:54:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'PLT',
 'CONCURRENT_REQUESTS': 14,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 14,
 'DOWNLOAD_DELAY': 0.05,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'PLT.spiders',
 'SPIDER_MODULES': ['PLT.spiders']}
2020-09-17 11:54:11 [scrapy.extensions.telnet] INFO: Telnet Password: # blocked
2020-09-17 11:54:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
=======
17_Sep_2020_11_54_12
=======
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled item pipelines:
['PLT.pipelines.PltPipeline']
2020-09-17 11:54:12 [scrapy.core.engine] INFO: Spider opened

2.Ending log lines (INFO: Dumping Scrapy stats to end).

2020-09-17 11:16:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15842233,
 'downloader/request_count': 42031,
 'downloader/request_method_count/GET': 42031,
 'downloader/response_bytes': 1108804016,
 'downloader/response_count': 42031,
 'downloader/response_status_count/200': 41999,
 'downloader/response_status_count/403': 9,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/504': 22,
 'dupefilter/filtered': 110,
 'elapsed_time_seconds': 3325.171148,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 17, 10, 16, 43, 258108),
 'httperror/response_ignored_count': 10,
 'httperror/response_ignored_status_count/403': 9,
 'httperror/response_ignored_status_count/404': 1,
 'item_scraped_count': 20769,
 'log_count/INFO': 75,
 'memusage/max': 2707484672,
 'memusage/startup': 100196352,
 'request_depth_max': 2,
 'response_received_count': 42009,
 'retry/count': 22,
 'retry/reason_count/504 Gateway Time-out': 22,
 'scheduler/dequeued': 42031,
 'scheduler/dequeued/memory': 42031,
 'scheduler/enqueued': 42031,
 'scheduler/enqueued/memory': 42031,
 'start_time': datetime.datetime(2020, 9, 17, 9, 21, 18, 86960)}
2020-09-17 11:16:43 [scrapy.core.engine] INFO: Spider closed (finished)
  1. what value used for self.num_pages variable?

The site I am scraping has around 20k products, and shows 48 per page. So it goes to the site, see's 20103 products, then divides by 48 (then math.ceil) to get the number of pages.

  1. Adding the output from scraping hub after updating the middleware
downloader/request_bytes    2945159
downloader/request_count    16518
downloader/request_method_count/GET 16518
downloader/response_bytes   3366280619
downloader/response_count   16516
downloader/response_status_count/200    16513
downloader/response_status_count/404    3
dupefilter/filtered 7
elapsed_time_seconds    4805.867308
finish_reason   memusage_exceeded
finish_time 1600567332341
httperror/response_ignored_count    3
httperror/response_ignored_status_count/404 3
item_scraped_count  8156
log_count/ERROR 1
log_count/INFO  94
memusage/limit_reached  1
memusage/max    1074937856
memusage/startup    109555712
request_depth_max   2
response_received_count 16516
retry/count 2
retry/reason_count/504 Gateway Time-out 2
scheduler/dequeued  16518
scheduler/dequeued/disk 16518
scheduler/enqueued  17280
scheduler/enqueued/disk 17280
start_time  1600562526474

Solution

  • 1.Scheruler queue/Active requests
    with self.numpages = 418.

    this code lines will create 418 request objects (including -to ask OS to delegate memory to hold 418 objects) and put them into scheduler queue :

    for page in tqdm(range(1, self.num_pages+1)):
                    url = 'www.example.com/page={page}'
                    yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)
    

    each "page" request generate 48 new requests.
    each "product page" request generate 1 "api_call" request
    each "api_call" request returns item object.
    As all requests have equal priority - on the worst case application will require memory to hold ~20000 request/response objects in RAM at once.

    In order to exclude this cases priority parameter can be added to scrapy.Request. And probably You will need to change spider configuration to something like this:

        def start_requests(self):
            yield scrapy.Request(url = 'www.example.com/page=1', headers=headers, callback = self.prod_url)
    
        def prod_url(self, response):
            #get number of page
            next_page_number = int(response.url.split("/page=")[-1] + 1
            #...
            for url in urls_item['urls']:
                    yield scrapy.Request(url = url, headers=headers, callback = self.parse, priority = 1)
    
            if next_page_number < self.num_pages:
                yield scrapy.Request(url = f"www.example.com/page={str(next_page_number)}"
    
        def parse(self, response) # Parse the main product page
            #....
            try:
                new_url = 'www.exampleAPI.com/' + item['active_col_num']
            except TypeError:
                new_url = 'www.exampleAPI.com/{dummy_number}'
        
            yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item}, priority = 2)
    

    With this spider configuration - spider will process product pages of next page only when it finish processing products from previous pages and your application will not receive long queue of requests/response.

    2.Http compression

    A lot websites compress html code to reduce traffic load.
    For example Amazon website compress it's product pages using gzip.
    Average size of compressed html of amazon product page ~250Kb
    Size of uncompressed html can exceed ~1.5Mb.

    In case if Your website use compression and response sizes of uncompressed html is similar to size of amazon product pages - app will require to spend a lot of memory to hold both compressed and uncompressed response bodies. And DownloaderStats middleware that populates downloader/response_bytes stats parameter will not count size of uncompresses responses as it's process_response method called before process_response method of HttpCompressionMiddleware middleware.

    In order to check it you will need to change priority of Downloader stats middleware by adding this to settings:

    DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.stats.DownloaderStats':50
    }
    

    In this case: downloader/request_bytes stats parameter - will be reduced as it will not count size of some headers populated by middlewares.
    downloader/response_bytes stats parameter - will be greatly increased in case if website uses compression.