pythonweb-scrapingtor

Intentionally rotating and holding IP addresses in web scraping


I am no scraping expert. I have a small fun Python project scraping data from a heavily guarded website, or so it seems, using Tor, Privoxy and a custom Python class. While there had been some caveats along the way, it works surprisingly well at the moment.

There's one thing I don't understand. Why do all libraries and snippets I've seen implement:

  1. IP rotation after n requests or after a specified time limit.
  2. Mechanisms to hold and release used IPs after n requests.

My approach is to simply pick a User-Agent, send the NEWNYM signal to Tor and scrape until the server kicks me out (403 or similar), then repeat with a new UA and IP. This has been outperforming the aforementioned techniques by far in terms of speed and reliability.

It is probably not a Tor-exclusive question, but some reasons to care specifically when it comes to Tor include the fact that there's a limited number of exit nodes and no guarantees the NEWNYM signal provides a different IP address every time.

I've had mixed success with free proxies, that's a topic I've yet to explore in depth.

What am I missing?


Solution

  • Doing 1 or 2 seems like a flawed approach if a website has no scraping detection and lets you make a trillion requests from the same IP without consequence. Every case is different so I'd just do whatever works or is required for your situation.

    Changing IPs after so many requests or so much time might help avoid detection if a website were to block your IP after viewing 1,000 different URLs from the same IP or viewing more than 60 pages per minute. By implementing those features they're making assumptions about how a website might treat a crawler, and they're also making assumptions about how you are using their code (fast crawler vs slow crawler, many pages vs few pages, etc).

    I wouldn't worry too much about it and stick with whatever is working for your use case. If it hasn't been necessary to change IPs frequently, then don't. If they start blocking you, then you'll need to change tactics again to something that works.