I am trying to scrape product data by product section from a Zen-cart store using Simple HTML DOM. I can scrape data from the first page fine but when I try to load the 'next' page of products the site returns the index.php landing page.
If I use the function directly with *http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36&sort=20a&page=2* it scrapes the product information from page 2 fine.
The same thing occurs if I use cURL.
getPrices('http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36');
function getPrices($sectionURL) {
$opts = array('http' => array('method' => "GET", 'header' => "Accept-language: en\r\n" . "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n" . "Cookie: zenid=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\r\n"));
$context = stream_context_create($opts);
$html = file_get_contents($sectionURL, false, $context);
$dom = new simple_html_dom();
$dom -> load($html);
//Do cool stuff here with information from page.. product name, image, price and more info URL
if ($nextPage = $dom -> find('a[title= Next Page ]', 0)) {
$nextPageURL = $nextPage -> href;
echo $nextPageURL;
$dom -> clear();
unset($dom);
getPrices($nextPageURL);
} else {
echo "\nNo more pages to scrape!!";
$dom -> clear();
unset($dom);
}
}
Any ideas on how to fix this problem?
Turned out next page URLs being passed to the function in loop were passing & instead of & and file_get_contents didn't like it.
$sectionURL = str_replace( "&", "&", urldecode(trim($sectionURL)) );