phpweb-scraping

PHP ISSET function still running when variable isn't set


Hopefully this is a very simple solution, I'm new to PHP so I'm probably missing something obvious. I'm building a scraper with ScraperWiki (although this is a problem with PHP and little to do with SW). The code is as follows:

<?php
require 'scraperwiki/simple_html_dom.php';

$allLinks = array();

function nextPage($nextUrl, $y)
{
    getLinks($nextUrl, $y);    
}

function getLinks($url) // gets links from product list page   
{
    global $allLinks;
    $html_content = scraperwiki::scrape($url);
    $html         = str_get_html($html_content);

    if (isset($y)) {
        $x = $y;
    } else {
        $x = 0;
    }

    foreach ($html->find("div.views-row a.imagecache-product_list") as $el) {
        $url          = $el->href . "\n";
        $allLinks[$x] = 'http://www.foo.com';
        $allLinks[$x] .= $url;
        $x++;
    }

    $next = $html->find("li.pager-next a", 0)->href . "\n";
    print_r("Printing $next:");
    print_r($next);

    if (isset($next)) {
        $nextUrl = 'http://www.foo.com';
        $nextUrl .= $next;
        print_r($nextUrl);
        $y = $x;
        print_r("Printing X:");
        print_r($x);
        print_r("Printing Y:");
        print_r($y);

        nextPage($nextUrl, $y);
    } else {
        return;
    }

}

getLinks("http://www.foo.com/department/accessories");

print_r($allLinks);

?>

EXPECTED OUTPUT: The script should scrape all of the links from the first page, find the "next page" button, scrape links from its URL, find the "next page" from that URL and so on and so forth. It should stop when there are no more "next page" links left.

CURRENT OUTPUT: The code is running fine, but it doesn't stop when it should. Here is the key line:

$next = $html->find("li.pager-next a", 0)->href . "\n";
if (isset($next)) { }

I ONLY want the "nextPage()" function to run if a li.pager-next a exists on the page. Here is the output from console:

     http://www.foo.com/department/accessories?page=1
        http://www.foo.com/department/accessories?page=2
        http://www.foo.com/department/accessories?page=3
        http://www.foo.com/department/accessories?page=4
        http://www.foo.com/department/accessories?page=5
        http://www.foo.com/department/accessories?page=6
        http://www.foo.com/department/accessories?page=7
        http://www.foo.com/department/accessories?page=8
        http://www.foo.com/department/accessories?page=9
        http://www.foo.com/department/accessories?page=10

    PHP Notice:  Trying to get property of non-object in /home/scriptrunner/script.php on line 31
 // THE LOOP SHOULD BREAK HERE BUT DOESN'T

        http://www.foo.com
        http://www.foo.com/home?page=1
        http://www.foo.com/home?page=2
        http://www.foo.com/home?page=3
        http://www.foo.com/home?page=4
        http://www.foo.com/home?page=5
        http://www.foo.com/home?page=6
        http://www.foo.com/home?page=7

Solution

  • What about this:

    $next = $html->find("li.pager-next a", 0);
    
    if (isset($next)) {
        $nextUrl = 'http://www.foo.com';
        $nextUrl .= $next->href; // move ->href here
        print_r($nextUrl . "\n"); // put \n here since we don't actually want that char in the url
        $y = $x;
        print_r("Printing X:");
        print_r($x);
        print_r("Printing Y:");
        print_r($y);
    
        nextPage($nextUrl, $y);
    } else {
        return;
    }