phpweb-scrapingsimple-html-dom

simple_html_dom scrape all lines with chracteristic and then output them below


I currently got this far in scraping with htmldom (as far as examples go)

<?php
require 'simple_html_dom.php';
$html = file_get_html('https://nitter.absturztau.be/chillartaholic');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>

However instead of retrieving a title and image, I'd like to instead get all lines in the target page that begin with:

<a class="tweet-link"

and display the lines scraped - in their entirety - top to bottom below.

(First scraped line would then be:

> <a class="tweet-link"
> href="/ChillArtaholic/status/1413973360841744390#m"></a>

Is this possible with htmldom (or are there limitations on the scrapeable number of lines et all?)


Solution

  • Strangely enough, the answer from yesterday is gone.

    This was the consensus that works (altho their answer had many different other approaches) :/

    <?php
    $dom = new DOMDocument;
    @$dom->loadHTML($html);
    $links = $dom->getElementsByTagName('a');
    $url = 'https://nitter.absturztau.be/chillartaholic';
    $html = file_get_contents($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $xpath = new DOMXPath($dom);
    $nodes = $xpath->query('//a[@class="tweet-link"]');
    
    foreach ($nodes as $node){
        echo $link->nodeValue;
        echo $node-> getAttribute('href'), '<br>';
    }
    ?>