pythoncachingscrapyalpine-linuxtmpfs

How to best handle Scrapy cache at 'OSError: [Errno 28] No space left on device' failure?


What's the advised action to take should Scrapy fail with exception:

OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 86, in process_response
    self._cache_response(spider, response, request, cachedresponse)
  File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 106, in _cache_response
    self.storage.store_response(spider, request, response)
  File "/usr/lib/python3.6/site-packages/scrapy/extensions/httpcache.py", line 317, in store_response
    f.write(to_bytes(repr(metadata)))
OSError: [Errno 28] No space left on device

In this specific case, a ramdisk/tmpfs limited to 128 MB was used as cache disk, with scrapy setting HTTPCACHE_EXPIRATION_SECS = 300 on httpcache.FilesystemCacheStorage

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 300
HTTPCACHE_DIR = '/tmp/ramdisk/scrapycache' # (tmpfs on /tmp/ramdisk type tmpfs (rw,relatime,size=131072k))
HTTPCACHE_IGNORE_HTTP_CODES = ['400','401','403','404','500','504']
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

I might be wrong, but i get the impression that Scrapy's FilesystemCacheStorage might not be managing it's cache (storage limitations) all that well (?) .

Might it be better to use LevelDB ?


Solution

  • You are right. Nothing will be deleted after the cache is expired. HTTPCACHE_EXPIRATION_SECS settings only decide whether to use cache response or re-download, for all HTTPCACHE_STORAGE.

    If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can extend the backend storage to add a LoopingCall Task to delete expired cache continously.

    Why scrapy keep around data that's being ignored?

    I think there are two points: