Background - TLDR: I have a memory leak in my project
Spent a few days looking through the memory leak docs with scrapy and can't find the problem. I'm developing a medium size scrapy project, ~40k requests per day.
I am hosting this using scrapinghub's scheduled runs.
On scrapinghub, for $9 per month, you are essentially given 1 VM, with 1GB of RAM, to run your crawlers.
I've developed a crawler locally and uploaded to scrapinghub, the only problem is that towards the end of the run, I exceed the memory.
Localling setting CONCURRENT_REQUESTS=16
works fine, but leads to exceeding the memory on scrapinghub at the 50% point. When I set CONCURRENT_REQUESTS=4
, I exceed the memory at the 95% point, so reducing to 2 should fix the problem, but then my crawler becomes too slow.
The alternative solution, is paying for 2 VM's, to increase the RAM, but I have a feeling that the way I've set up my crawler is causing memory leaks.
For this example, the project will scrape an online retailer.
When run locally, my memusage/max
is 2.7gb with CONCURRENT_REQUESTS=16
.
I will now run through my scrapy structure
class Pipeline(object):
def process_item(self, item, spider):
item['stock_jsons'] = json.loads(item['stock_jsons'])['subProducts']
return item
class mainItem(scrapy.Item):
date = scrapy.Field()
url = scrapy.Field()
active_col_num = scrapy.Field()
all_col_nums = scrapy.Field()
old_price = scrapy.Field()
current_price = scrapy.Field()
image_urls_full = scrapy.Field()
stock_jsons = scrapy.Field()
class URLItem(scrapy.Item):
urls = scrapy.Field()
class ProductSpider(scrapy.Spider):
name = 'product'
def __init__(self, **kwargs):
page = requests.get('www.example.com', headers=headers)
self.num_pages = # gets the number of pages to search
def start_requests(self):
for page in tqdm(range(1, self.num_pages+1)):
url = 'www.example.com/page={page}'
yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)
def prod_url(self, response):
urls_item = URLItem()
extracted_urls = response.xpath(####).extract() # Gets URLs to follow
urls_item['urls'] = [# Get a list of urls]
for url in urls_item['urls']:
yield scrapy.Request(url = url, headers=headers, callback = self.parse)
def parse(self, response) # Parse the main product page
item = mainItem()
item['date'] = DATETIME_VAR
item['url'] = response.url
item['active_col_num'] = XXX
item['all_col_nums'] = XXX
item['old_price'] = XXX
item['current_price'] = XXX
item['image_urls_full'] = XXX
try:
new_url = 'www.exampleAPI.com/' + item['active_col_num']
except TypeError:
new_url = 'www.exampleAPI.com/{dummy_number}'
yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item})
def parse_attr(self, response):
## This calls an API Step 5
item = response.meta['item']
item['stock_jsons'] = response.text
yield item
What I've tried so far?
psutils, haven't helped too much.
trackref.print_live_refs()
returns the following at the end:
HtmlResponse 31 oldest: 3s ago
mainItem 18 oldest: 5s ago
ProductSpider 1 oldest: 3321s ago
Request 43 oldest: 105s ago
Selector 16 oldest: 3s ago
QUESTIONS
Please let me know if there is any more information required
Additional Information Requested
Please let me know if the output from scrapinghub is required, I think it should be the same, but the message for finish reason, is memory exceeded.
1.Log lines from start(from INFO: Scrapy xxx started to spider opened).
2020-09-17 11:54:11 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: PLT)
2020-09-17 11:54:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-09-17 11:54:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'PLT',
'CONCURRENT_REQUESTS': 14,
'CONCURRENT_REQUESTS_PER_DOMAIN': 14,
'DOWNLOAD_DELAY': 0.05,
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'PLT.spiders',
'SPIDER_MODULES': ['PLT.spiders']}
2020-09-17 11:54:11 [scrapy.extensions.telnet] INFO: Telnet Password: # blocked
2020-09-17 11:54:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
=======
17_Sep_2020_11_54_12
=======
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled item pipelines:
['PLT.pipelines.PltPipeline']
2020-09-17 11:54:12 [scrapy.core.engine] INFO: Spider opened
2.Ending log lines (INFO: Dumping Scrapy stats to end).
2020-09-17 11:16:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15842233,
'downloader/request_count': 42031,
'downloader/request_method_count/GET': 42031,
'downloader/response_bytes': 1108804016,
'downloader/response_count': 42031,
'downloader/response_status_count/200': 41999,
'downloader/response_status_count/403': 9,
'downloader/response_status_count/404': 1,
'downloader/response_status_count/504': 22,
'dupefilter/filtered': 110,
'elapsed_time_seconds': 3325.171148,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 17, 10, 16, 43, 258108),
'httperror/response_ignored_count': 10,
'httperror/response_ignored_status_count/403': 9,
'httperror/response_ignored_status_count/404': 1,
'item_scraped_count': 20769,
'log_count/INFO': 75,
'memusage/max': 2707484672,
'memusage/startup': 100196352,
'request_depth_max': 2,
'response_received_count': 42009,
'retry/count': 22,
'retry/reason_count/504 Gateway Time-out': 22,
'scheduler/dequeued': 42031,
'scheduler/dequeued/memory': 42031,
'scheduler/enqueued': 42031,
'scheduler/enqueued/memory': 42031,
'start_time': datetime.datetime(2020, 9, 17, 9, 21, 18, 86960)}
2020-09-17 11:16:43 [scrapy.core.engine] INFO: Spider closed (finished)
The site I am scraping has around 20k products, and shows 48 per page. So it goes to the site, see's 20103 products, then divides by 48 (then math.ceil) to get the number of pages.
downloader/request_bytes 2945159
downloader/request_count 16518
downloader/request_method_count/GET 16518
downloader/response_bytes 3366280619
downloader/response_count 16516
downloader/response_status_count/200 16513
downloader/response_status_count/404 3
dupefilter/filtered 7
elapsed_time_seconds 4805.867308
finish_reason memusage_exceeded
finish_time 1600567332341
httperror/response_ignored_count 3
httperror/response_ignored_status_count/404 3
item_scraped_count 8156
log_count/ERROR 1
log_count/INFO 94
memusage/limit_reached 1
memusage/max 1074937856
memusage/startup 109555712
request_depth_max 2
response_received_count 16516
retry/count 2
retry/reason_count/504 Gateway Time-out 2
scheduler/dequeued 16518
scheduler/dequeued/disk 16518
scheduler/enqueued 17280
scheduler/enqueued/disk 17280
start_time 1600562526474
1.Scheruler queue/Active requests
with self.numpages = 418
.
this code lines will create 418 request objects (including -to ask OS to delegate memory to hold 418 objects) and put them into scheduler queue :
for page in tqdm(range(1, self.num_pages+1)):
url = 'www.example.com/page={page}'
yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)
each "page" request generate 48 new requests.
each "product page" request generate 1 "api_call" request
each "api_call" request returns item object.
As all requests have equal priority - on the worst case application will require memory to hold ~20000 request/response objects in RAM at once.
In order to exclude this cases priority
parameter can be added to scrapy.Request
.
And probably You will need to change spider configuration to something like this:
def start_requests(self):
yield scrapy.Request(url = 'www.example.com/page=1', headers=headers, callback = self.prod_url)
def prod_url(self, response):
#get number of page
next_page_number = int(response.url.split("/page=")[-1] + 1
#...
for url in urls_item['urls']:
yield scrapy.Request(url = url, headers=headers, callback = self.parse, priority = 1)
if next_page_number < self.num_pages:
yield scrapy.Request(url = f"www.example.com/page={str(next_page_number)}"
def parse(self, response) # Parse the main product page
#....
try:
new_url = 'www.exampleAPI.com/' + item['active_col_num']
except TypeError:
new_url = 'www.exampleAPI.com/{dummy_number}'
yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item}, priority = 2)
With this spider configuration - spider will process product pages of next page only when it finish processing products from previous pages and your application will not receive long queue of requests/response.
2.Http compression
A lot websites compress html code to reduce traffic load.
For example Amazon website compress it's product pages using gzip.
Average size of compressed html of amazon product page ~250Kb
Size of uncompressed html can exceed ~1.5Mb.
In case if Your website use compression and response sizes of uncompressed html is similar to size of amazon product pages - app will require to spend a lot of memory to hold both compressed and uncompressed response bodies.
And DownloaderStats
middleware that populates downloader/response_bytes
stats parameter will not count size of uncompresses responses as it's process_response
method called before process_response
method of HttpCompressionMiddleware
middleware.
In order to check it you will need to change priority of Downloader stats middleware by adding this to settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.stats.DownloaderStats':50
}
In this case:
downloader/request_bytes
stats parameter - will be reduced as it will not count size of some headers populated by middlewares.
downloader/response_bytes
stats parameter - will be greatly increased in case if website uses compression.