pythonproxyweb-scrapingscrapyweb-crawler

Get proxy ip address scrapy using to crawl


I use Tor to crawl web pages. I started tor and polipo service and added

class ProxyMiddleware(object):   # overwrite process request   def
  process_request(self, request, spider):
     # Set the location of the proxy
    request.meta['proxy'] = "127.0.0.1:8123"

Now, how can I make sure that scrapy uses different IP address for requests?


Solution

  • You can yield the first request to check your public IP, and compare this to the IP you see when you go to http://checkip.dyndns.org/ without using Tor/VPN. If they are not the same, scrapy is using a different IP obviously.

    def start_requests(self):
        yield Request('http://checkip.dyndns.org/', callback=self.check_ip)
        # yield other requests from start_urls here if needed
    
    def check_ip(self, response):
        pub_ip = response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]
        print "My public IP is: " + pub_ip
       
        # yield other requests here if needed