I am using customly configured VM to act as a proxy server (via squid) and now I try to use it for my scraper. I am using scrapy-rotating-proxies to rotate trought my ip list definition but the problem is that my proxy is treated as DEAD right on the first attempt even thought I have verified that the proxy address is alive and is working just fine (I tested it by setting a proxy in firefox and tried to browse both http
and https
web pages. The proxy server is passwordless for testing purposes
scrapy settings
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
"scrapy_fake_useragent.middleware.RandomUserAgentMiddleware": 400,
"scrapy_fake_useragent.middleware.RetryUserAgentMiddleware": 401,
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
}
ROTATING_PROXY_LIST = ["X.X.X.X:3128"]
scrapy logs
2022-12-02 13:31:22 [scrapy.core.engine] INFO: Spider opened
2022-12-02 13:31:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 13:31:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-02 13:31:22 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 1, reanimated: 0, mean backoff time: 0s)
2022-12-02 13:31:32 [rotating_proxies.expire] DEBUG: Proxy <http://X.X.X.X:3128> is DEAD
2022-12-02 13:31:32 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.johnlewis.com/header/api/config> with another proxy (failed 1 times, max retries: 5)
2022-12-02 13:31:32 [rotating_proxies.middlewares] WARNING: No proxies available; marking all proxies as unchecked
Settings I have changed for squid
http_access allow all
via off
forwarded_for delete
Please advice what can be the issue
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": None,
These middlewares was the issue, I cannot explain why scrapy was able to process my requests without proxies while having these middlewares enabled but after disabling them I was able to run scrapy using my proxies.