scrapymiddleware

Advanced Scrapy use Middleware


I want to develop many middlewares to be sure websites will parse. This is the workflow I think:

I'll create a custom middleware, with process_request function which contains all of these 5 methods. But I don't find how save type of connection (for example if TOR does not work, but direct connection does, I want to use this settings for all of my other scrap, for the same website). How can I save this settings?

Other thoughts, I have a pipeline which downloads images of items. Is there a solution to use this middleware (ideally with saving settings) to use on it?

Thanks in advance for your help.


Solution

  • I think you could use the retry middleware as a starting point:

    1. You could use request.meta["proxy_method"] to keep track of which one you are using

    2. You could reuse request.meta["retry_times"] in order to track how many times you have retried a given method, and then set the value to zero when you change the proxy method.

    3. You could use request.meta["proxy"] to use the proxy server you want via the existing HTTP proxy middleware. You may want to tweak the middlewares ordering so that the retry middleware runs before the proxy middleware.