[SOLVED] Advanced Scrapy use Middleware

Advanced Scrapy use Middleware

I want to develop many middlewares to be sure websites will parse. This is the workflow I think:

First try with TOR + Polipo
If 2 HTTP errors, try without TOR (so website know my IP)
If 2 HTTP errors, try with proxy (use one of my other server to make HTTP REQ)
If 2 HTTP errors, try with random proxy (on list of 100). This is repeat 5 times
If none works, I save information on ElasticSearch database, to see on my control panel

I'll create a custom middleware, with process_request function which contains all of these 5 methods. But I don't find how save type of connection (for example if TOR does not work, but direct connection does, I want to use this settings for all of my other scrap, for the same website). How can I save this settings?

Other thoughts, I have a pipeline which downloads images of items. Is there a solution to use this middleware (ideally with saving settings) to use on it?

Thanks in advance for your help.

Solution

I think you could use the retry middleware as a starting point:

You could use request.meta["proxy_method"] to keep track of which one you are using
You could reuse request.meta["retry_times"] in order to track how many times you have retried a given method, and then set the value to zero when you change the proxy method.
You could use request.meta["proxy"] to use the proxy server you want via the existing HTTP proxy middleware. You may want to tweak the middlewares ordering so that the retry middleware runs before the proxy middleware.