python caching scrapy alpine-linux tmpfs

How to best handle Scrapy cache at 'OSError: [Errno 28] No space left on device' failure?

What's the advised action to take should Scrapy fail with exception:

OSError: [Errno 28] No space left on device

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
    spider=spider)
  File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 86, in process_response
    self._cache_response(spider, response, request, cachedresponse)
  File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 106, in _cache_response
    self.storage.store_response(spider, request, response)
  File "/usr/lib/python3.6/site-packages/scrapy/extensions/httpcache.py", line 317, in store_response
    f.write(to_bytes(repr(metadata)))
OSError: [Errno 28] No space left on device

In this specific case, a ramdisk/tmpfs limited to 128 MB was used as cache disk, with scrapy setting HTTPCACHE_EXPIRATION_SECS = 300 on httpcache.FilesystemCacheStorage

HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 300
HTTPCACHE_DIR = '/tmp/ramdisk/scrapycache' # (tmpfs on /tmp/ramdisk type tmpfs (rw,relatime,size=131072k))
HTTPCACHE_IGNORE_HTTP_CODES = ['400','401','403','404','500','504']
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

I might be wrong, but i get the impression that Scrapy's FilesystemCacheStorage might not be managing it's cache (storage limitations) all that well (?) .

Might it be better to use LevelDB ?

Solution

You are right. Nothing will be deleted after the cache is expired. HTTPCACHE_EXPIRATION_SECS settings only decide whether to use cache response or re-download, for all HTTPCACHE_STORAGE.

If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can extend the backend storage to add a LoopingCall Task to delete expired cache continously.

Why scrapy keep around data that's being ignored?

I think there are two points:

HTTPCACHE_EXPIRATION_SECS control whether to use cache response or re-download, it only gurantee that you use no-expire cache. Different spiders may set different expiration_secs, deleting cache will make cache in confusion.
If you want to delete expired cache, there will need a LoopingCall Task to check expired cache continously, it make scrapy extension more complex, which not scrapy want to be.