phpweb-scraping

How to 'scrape' content from a page's source?


I have this code which gets the HTML source of a page:

$page = file_get_contents('http://example.com/page.html');
$page = htmlentities($page);

I want to scrape some content from it. For example, say the page's source contains this:

<strong>technorati.com</strong><br />
Connection failed<br /><br />Pinging <strong>icerocket.com</strong><br />
Connection failed<br /><br />Pinging <strong>weblogs.com</strong><br />
Done<br /><br />Pinging <strong>newsgator.com</strong><br />
Done<br /><br />Pinging <strong>blo.gs</strong><br />
Done<br /><br />Pinging <strong>feedburner.com</strong><br />
Done<br /><br />Pinging <strong>blogstreet.com</strong><br />
Done<br /><br />Pinging <strong>my.yahoo.com</strong><br />
Connection failed<br /><br />Pinging <strong>moreover.com</strong><br />
Connection failed<br /><br />Pinging <strong>newsisfree.com</strong><br />
Done<br />

Is there a way I could scrape this from the source and store it in a variable, so it'll look like this:

technorati.com Connection failed
icerocket.com Connection failed
eblogs.com Done
Ect.

Of cause the page is dynamic which is why I'm having a problem. Could I maybe search for each site in the source? But then how would I get the result which is after it? (Connection failed / Done)
Thanks a lot for the help!


Solution

  • I have tried scraping multiple sites using the simple HTML DOM PHP library, which can be obtained here: http://simplehtmldom.sourceforge.net/

    Then using code like this:

    <?php
    include_once 'simple_html_dom.php';
    
    $url = "http://slashdot.org/";
    $html = file_get_html($url);
    
    //remove additional spaces
    $pat[0] = "/^\s+/";
    $pat[1] = "/\s{2,}/";
    $pat[2] = "/\s+\$/";
    $rep[0] = "";
    $rep[1] = " ";
    $rep[2] = "";
    
    foreach($html->find('h2') as $heading) { //for each heading
            //find all spans with a inside then echo the found text out
            echo preg_replace($pat, $rep, $heading->find('span a', 0)->plaintext) . "\n"; 
    }
    ?>
    

    This results in something like:

    5.8 Earthquake Hits East Coast of the US
    Origins of Lager Found In Argentina
    Inside Oregon State University's Open Source Lab
    WebAPI: Mozilla Proposes Open App Interface For Smartphones
    Using Tablets Becoming Popular Bathroom Activity
    The Syrian Government's Internet Strategy
    Deus Ex: Human Revolution Released
    Taken Over By Aliens? Google Has It Covered
    The GIMP Now Has a Working Single-Window Mode
    Zombie Cookies Just Won't Die
    Motorola's Most Important 18 Patents
    MK-1 Robotic Arm Capable of Near-Human Dexterity, Dancing
    Evangelical Scientists Debate Creation Story
    Android On HP TouchPad
    Google Street View Gets Israeli Government's Nod
    Internet Restored In Tripoli As Rebels Take Control
    GA Tech: Internet's Mid-Layers Vulnerable To Attack
    Serious Crypto Bug Found In PHP 5.3.7
    Twitter To Meet With UK Government About Riots
    EU Central Court Could Validate Software Patents