phpweb-crawlernodesgoutte

PHP - Regex/Function for Node Traversing DOM to get specific tag


I'm using Goutte to crawl an URL with PHP.

I want to save a list <ul>...</ul> just after this tag : <p><strong>Maladies fréquentes :</strong></p>

The DOM looks like this structure :

<p>....</p>
<p>....</p>
<p>....</p>
<p>....</p>
...
<h2>...</h2>
...
<ul>...</ul>
...
<p><strong>Maladies fréquentes :</strong></p>
<ul>
<li>Text I need</li>
<li>Text I need</li>
</ul>
...
<p></p>
<p></p>
...

Actually, I save to my DB using :first-of-type

$crawler->filter('.desc ul:first-of-type li')->each(function ($node) use (&$out) {

   $li = array();

   if ($node->count() > 0) {
        $li[] = str_replace('"', "'", trim($node->filter('li')->text()));
   }

   // Insert into DV

}

When the content contains 2 or 3 <ul>...</ul> It always save wrong li because all ul are selected.

How can I select only the <ul> after <p><strong>Maladies fréquentes :</strong></p> ?

Thanks !


Solution

  • Don't know much about Goutte, but I believe you can load the crawler object into DomDocument and then parse it with xpath. Something like:

    $doc = new DOMDocument();    
    $doc->loadHTML($crawler);
    #or possibly: $doc->loadHTML((string)$crawler);
    $xpath = new DOMXPath($doc);
    $targets = $xpath->query('//p[strong]/following-sibling::ul[1]//li');
    #or possibly: $targets = $xpath->query('//p[contains(strong,"Maladies")]/following-sibling::ul[1]//li');
    foreach ($targets as $source) {
        echo($source->nodeValue."\r\n");
    };
    

    The output should be

    Text I need
    Text I need