scrapyscrapinghub

scrapinghub: Difference DeltaFetch and HTTPCACHE_ENABLED


I struggle to understand the difference between DeltaFetch and HttpCacheMiddleware. Both have the goal that I only scrape pages I haven't requested before?


Solution

  • They have very different purposes:

    Everytime a new request is made it will fetch that data and save it locally. Everytime the same request is made again, it's fetched from disk (so local cache).

    This is very useful for development, when you probably want to fetch the same page multiple times, until your script works correctly and saves the data you want correctly. With this feature you only fetch the page from the remote/origin server once.

    However if the data changes, you will be working with an old copy (which is usually fine for development puprposes).

    HttpCacheMiddleware docs

    Deltafetch keeps a fingerprint of all requests that have already been fetched and turned into an Item (or dict). If the spider outputs a request seen before, it will be ignored.

    This is useful in production, when a site has multiples links to the same content, thus avoiding requesting duplicated items.

    DeltaFetch assumes there's a 1-to-1 relation between requests/links and items. So if you're crawling multiple items from the same request this can be problematic, as all requests will be ignored after the first item from that request is fetched (this is a somewhat convoluted corner case).

    DeltaFetch docs

    By default, scrapy will not fetch duplicated requests. You can customize what "duplicate requests" mean. For instance, maybe the query part of an url should be ignored when comparing requests.

    DUPEFILTER_CLASS docs