scrapytorscrapy-splashsplash-js-renderpolipo

Scrapy-Splash with Tor


I have succeed to run Scrapy with Tor using this link: http://pkmishra.github.io/blog/2013/03/18/how-to-run-scrapy-with-TOR-and-multiple-browser-agents-part-1-mac/

But i couldn't run Splash with Tor.

In Scrapy-settings.py I directed to polipo for http_proxy(8123 is polipo port):

HTTP_PROXY = 'http://127.0.0.1:8123'

In polipo.config, I directed to tor(9150 is tor port):

socksParentProxy = localhost:9150

diskCacheRoot=""

Which works perfect for scrapy. In splash it doesn't work. But i have to say splash or docker to use polipo for http_proxy like in scrapy-settings.py. Docker should somehow use polipo, and polipo will direct to tor. How can i do that?

I run splash with:

sudo docker run -p 5023:5023 -p 8050:8050 -p 80511 scrapinghub/splash

and in etc/default/docker i tried docker should direct to polipo with this:

export http_proxy='http://127.0.0.1:8123'
Environment="http_proxy=http://127.0.0.1:8123"

But i couldn't succeed. What am i doing wrong? Thanks :)


Solution

  • You need to

    1. make Tor accessible from Splash Docker container;
    2. tell Splash to use this Tor proxy.

    For (2) you can use either Splash proxy profiles or set proxy directly, either in proxy argument, or using request:set_proxy in splash:on_request callback a Lua script. For example, if Tor can be accessed from Splash Docker container as tor:8123, you can do a request like this:

    http://<splash-url>:8050/render.html?url=...&proxy=socks5://tor:8123
    

    Also, take a look at https://github.com/TeamHG-Memex/aquarium - it setups all of this - it sets up 'tor' proxy profile, starts Tor in another Docker container, and links these containers. To access remote website using Tor in a Splash deployed via Aquarium you can just add proxy=tor GET argument to a request:

    http://<splash-url>:8050/render.html?url=...&proxy=tor