What's the advised action to take should Scrapy fail with exception:
OSError: [Errno 28] No space left on device
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 53, in process_response
spider=spider)
File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 86, in process_response
self._cache_response(spider, response, request, cachedresponse)
File "/usr/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpcache.py", line 106, in _cache_response
self.storage.store_response(spider, request, response)
File "/usr/lib/python3.6/site-packages/scrapy/extensions/httpcache.py", line 317, in store_response
f.write(to_bytes(repr(metadata)))
OSError: [Errno 28] No space left on device
In this specific case, a ramdisk/tmpfs limited to 128 MB was used as cache disk, with scrapy setting HTTPCACHE_EXPIRATION_SECS = 300 on httpcache.FilesystemCacheStorage
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 300
HTTPCACHE_DIR = '/tmp/ramdisk/scrapycache' # (tmpfs on /tmp/ramdisk type tmpfs (rw,relatime,size=131072k))
HTTPCACHE_IGNORE_HTTP_CODES = ['400','401','403','404','500','504']
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
I might be wrong, but i get the impression that Scrapy's FilesystemCacheStorage might not be managing it's cache (storage limitations) all that well (?) .
Might it be better to use LevelDB ?
You are right. Nothing will be deleted after the cache is expired. HTTPCACHE_EXPIRATION_SECS
settings only decide whether to use cache response or re-download, for all HTTPCACHE_STORAGE
.
If your cache data is very large, you should consider to use DB to store instead of local file system. Or you can extend the backend storage to add a LoopingCall Task to delete expired cache continously.
Why scrapy keep around data that's being ignored?
I think there are two points:
HTTPCACHE_EXPIRATION_SECS
control whether to use cache response or re-download, it only gurantee that you use no-expire cache. Different spiders may set different expiration_secs, deleting cache will make cache in confusion.
If you want to delete expired cache, there will need a LoopingCall Task to check expired cache continously, it make scrapy extension more complex, which not scrapy want to be.