multithreadingperlrequestlwp-useragent

How convert a simple Perl request script to Request Multithreading


What I'm trying to do:

  1. Use an array with urls.
  2. Check the content of each page.
  3. Show which page has the content I need.

How can convert this script to multithreading in:

#!/usr/bin/perl
use LWP::UserAgent;
use HTTP::Request;
use LWP::Simple;

print "=> Verify \n\n";

@array = ("http://example.com","http://example.com/index.php","http://example.com/news.php","http://example.com/categories.php");

 $lastPosition = $#array;

 for ($i = 0; $i <= $lastPosition; $i++) {

  $url = $array[$i];

  my $req=HTTP::Request->new(GET=>$url);
  my $ua=LWP::UserAgent->new();
  $ua->timeout(10);
  my $resposta=$ua->request($req);

    if($resposta->content =~ /Exist/) {

       print "Content Exist !"

    }

}


Solution

  • For this kind of problem, using forking makes much more sense than using threads. Perl's support for threads is complicated, while using child processes with forking is much easier.

    You can use the convenient Parallel::ForkManager to handle all of the details for you. In fact, its documentation contains an example of fetching websites just at the top. However, this example downloads them to files, so we will look at some other code.

    It looks like you have taken a couple of different scripts from various resources to copy/paste what you've got so far. There is a bunch of unnecessary stuff in your code. I'll show you a much easier way.

    #!perl
    use strict;
    use warnings;
    
    use WWW::Mechanize;
    use Parallel::ForkManager;
    
    # You could set defaults such as headers or the AGENT header here if you want
    my $mech = WWW::Mechanize->new;
    
    # You can set how many processes you want in to run in parallel here
    my $pm = Parallel::ForkManager->new(10);
    
    my @urls = qw( http://example.com http://example.org https://www.w3.org/ );
    DATA_LOOP:
    foreach my $url (@urls) {
    
        # Forks and returns the pid for the child:
        my $pid = $pm->start and next DATA_LOOP;
    
        #####
        # This code is run in each worker...
    
        # Mechanize handles the
        $mech->get($url);
        if ($mech->content =~ m/Example/) {
            print "$url: Content exists!\n";
        }
    
        # Terminates the child process
        $pm->finish;
    
        # ... end of the worker
        #####
    
    }
    

    This uses WWW::Mechanize instead of LWP::UserAgent. The former is a popular sub-class of the latter, and it brings with it a bunch of convenient methods that help us access the content of the pages much easier. You do not need to load anything else such as HTTP::Headers or LWP::Simple (which you didn't actually use).

    All we have to do is loop over your list of URLs, start a new Parallel::ForkManager worker and then write a few lines of code to get the page and check its content.

    You can tell Parallel::ForkManager how many workers you want in parallel. This number should reflect your machine's number of cores, and how wide your network connection is. Because this is very network-heavy code, you could probably run a lot more workers than you have CPU cores, as they mostly will be waiting for the other end of the internet to respond. It also depends on how many URLs there are. If you set this number too high, They might slow each other down, and it will in fact take longer.