perlweb-scrapingproxytortorsocks

Scrape from .onion site using Web::Scraper


Problem: Scrape from tor .onion site using Web::Scraper

I would like to modify my code to connect to .onion site. I believe I need to connect to the SOCKS5 proxy, but unsure of how to do it with Web::Scraper

Existing code:

use Web::Scraper;
my $piratelink=$PIRATEBAYSERVER.'/search/' . $srstring . '%20'. 's'.$sval[1].'e'.$epinum.'/0/7/0';
my $purlToScrape = $piratelink;
    my $ns = scraper {      
    process "td>a", 'mag[]' => '@href';
    process "td>div>a", 'tor[]' => '@href';
    process "td font.detDesc", 'sizerow[]' => 'TEXT';
};
my $mres = $ns->scrape(URI->new($purlToScrape));

Solution

  • Web::Scraper uses LWP if you pass a URI to scrape.

    You can either fetch the HTML using some other HTTP library that uses SOCKS, or using the shared UserAgent variable from Web::Scraper, you can set up LWP to use SOCKS and pass that as the agent.

    use strict;
    use LWP::UserAgent;
    use Web::Scraper;
    
    # set up a LWP object with Tor socks address
    my $ua = LWP::UserAgent->new(
        agent => q{Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; YPC 3.2.0; .NET CLR 1.1.4322)},
    );
    $ua->proxy([qw/ http https /] => 'socks://localhost:9050'); # Tor proxy
    $ua->cookie_jar({});
    
    my $PIRATEBAYSERVER = 'http://uj3wazyk5u4hnvtk.onion';
    my $srstring = 'photoshop';
    
    
    my $piratelink=$PIRATEBAYSERVER.'/search/' . $srstring; # . '%20'. 's'.$sval[1].'e'.$epinum.'/0/7/0';
    
    my $purlToScrape = $piratelink;
    my $ns = scraper {      
        process "td>a", 'mag[]' => '@href';
        process "td>div>a", 'tor[]' => '@href';
        process "td font.detDesc", 'sizerow[]' => 'TEXT';
    };
    
    # override Scraper's UserAgent with our SOCKS LWP object
    $Web::Scraper::UserAgent = $ua;
    
    my $mres = $ns->scrape(URI->new($purlToScrape));
    
    print $mres;
    

    Note, you will also need to install the CPAN module LWP::Protocol::socks