phpregexsymfonyweb-scrapingdomcrawler

Web Scrape Symfony2 - Impossible Challenge - Crawler Parsing


(Edit: I've still found no way of solving this problem. The $crawler object seems ridiculous to work with, I just want to parse it for a specific <td> text, how hard is that? I cannot serialize() the entire crawler object either and make the entire source code for the web page into a string, or else I could just parse that string the hard way. Please help. I feel I've described the problem well, below.)

Below I'm using Symfony, Goutte, and DomCrawler to scrape a web page. I've been trying to figure it out through other questions with no success, but now I'm just going to post all my code to make this as straight forward as possible.

I am able to get the page and get the first bit of data I'm looking for. The first is a url that is printed from javascript and lies withing an a tag with an onclick and is a long string, so I use a preg_match to sift through and get exactly what I need.

The next bit of data I need is some text within a <td> tag. The thing is, this web page has 10-20 different <table> tags, and there are no id="" or class="" tags so it's hard to isolate. So what I'm trying to do is search for the words "Event Title" then go to the next sibling <td> tag and extract the innerHtml of that, which will be the actual title.

The problem is that for the second part I can't seem to parse properly through the $crawler object. I don't understand, I did a preg_match before on a serialize() version of the $crawler object, but for the bottom half I can't seem to parse through properly.

$crawler = $client->request('GET', 'https://movies.randomjunk.com/events/EventServlet?ab=mov&eventId=154367');



$aurl = 'http://movies.randomjunk.com/r.htm?e=154367'; // event url beginning string
$gas = $overview->filter('a[onclick*="' . $aurl . '"]');

$string1 = serialize($gas->filter('a')->attr('onclick')); //TEST
$string1M = preg_match("/(?<=\')(.*?)(?=\')/", $string1, $finalURL); 
$aString = $finalURL[0];
echo "<br><br>" . $aString . "<br><br>";
// IT WORKS UP TO HERE


// $title = $crawler->filterXPath('//td[. = "Event Title"]/following-sibling::td[1]')->each(funtion (Crawler $crawler, $i) {
//     return $node->text();
// }); // No clue why, but this doesn't work. 

$html = $overview->getNode(0)->ownerDocument->saveHTML();


$re = "/>Event\sTitle.*?<\\/td>.*?<td>\\K.*?(?=<\\/td>)/s";
$str = serialize($html);
print_r($str);
preg_match_all($re, $str, $matches);
$gas2 = $matches[0];


echo "<pre>";
    print_r($gas2);
echo "</pre>";

My preg_match just returns an empty array. I think it's a problem with searching the $crawler object, since it's made up of many nodes. I've been trying to just convert it all to html then to a preg_match but it just refuses to work. I've done a few print_r statements, and it just returns the whole web page.

Here's an example of some of the html in side the crawler object:

{lots of other html and tables}
<table> 
    <tr>
        <td>Title</td>
        <td>The Harsh Face of Mother Nature</td>
        <td>The Harsh Face of Mother Nature</td>
    </tr>
    .
    .
</table>
{lots of other html and tables} 

And the goal is to parse through the entire page/$crawler object and get the title "The Harsh Face of Mother Nature".

I know this must be possible, but the only answer anyone wants to provide is a link to the domcrawler page which I've read about a thousand times at this point. Please help.


Solution

  • Given the html fragment above I was able to come up with the XPath of:

    //table/tr/td[.='Title']/following-sibling::td[1]
    

    You can test the XPath with your provided html fragment at Here

    $html = '<table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table>';
    $crawler = new Symfony\Component\DomCrawler\Crawler($html);
    
    $query = "//table/tr/td[.='Event Title']/following-sibling::td[1]";
    $crawler->filterXPath($query)->each(function($crawler, $i) {
    echo $crawler->text() . PHP_EOL;
    

    });

    Which outputs:

    The Harsh Face of Mother Nature
    The Harsh Face of Mother Nature
    The Harsh Face of Mother Nature
    

    Update: Tested successfully with:

    $html = '<html><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table></html>';
    

    Update: After being provided with sample html from the website I was able to get things to parse with the following XPath:

    //td[normalize-space(text()) = 'Event Title']/following-sibling::td[1]
    

    The real issue was the leading and trailing white space that was around "Event Title".