perlweb-scrapingwww-mechanizewww-mechanize-firefox

WWW::Mechanize::Firefox looping though links


I am using a foreach to loop through links. Do I need a $mech->back(); to continue the loop or is that implicit.

Furthermore do I need a separate $mech2 object for nested for each loops?

The code I currently have gets stuck (it does not complete) and ends on the first page where td#tabcolor3 is not found.

foreach my $sector ($mech->selector('a.link2'))
{
    $mech->follow_link($sector);

    foreach my $place ($mech->selector('td#tabcolor3'))
    {
            if (($mech->selector('td#tabcolor3', all=>1)) >= 1)
    {
        $mech->follow_link($place);
            print $_->{innerHTML}, '\n'
            for $mech->selector('td.dataCell');
        $mech->back();
    }
    else
    {
        $mech->back();
    }
}

Solution

  • You cannot access information from a page when it is no longer on display. However, the way foreach works is to build the list first before it is iterated through, so the code you have written should be fine.

    There is no need for the call to back as the links are absolute. If you had used click then there must be a link in the page to click on, but with follow_link all you are doing is going to a new URL.

    There is also no need to check the number of links to follow, as a for loop over an empty list will simply not be executed.

    To make things clearer I suggest that you assign the results of selector to an array before the loop.

    Like this

    my @sectors = $mech->selector('a.link2');
    for my $sector (@sectors) {
    
        $mech->follow_link($sector);
    
        my @places = $mech->selector('td#tabcolor3');
        for my $place (@places) {
    
            $mech->follow_link($place);
    
            print $_->{innerHTML}, '\n' for $mech->selector('td.dataCell');
        }
    }
    

    Update

    My apologies. It seems that follow_link is finicky and needs to follow a link on the current page.

    I suggest that you extract the href attribute from each link and use get instead of follow_link.

    my @selectors = map $_->{href}, $mech->selector('a.link2');
    for my $selector (@selectors) {
    
        $mech->get($selector);
    
        my @places = map $_->{href}, $mech->selector('td#tabcolor3');
        for my $place (@places) {
    
            $mech->get($place);
    
            print $_->{innerHTML}, '\n' for $mech->selector('td.dataCell');
        }
    }
    

    Please let me know whether this works on the site you are connecting to.