phpweb-scrapinghtml-parsing

PHP html scraping


Its my first post on the site so bear with me

Ok so i'm a complete beginner with PHP and I have a specific need for it for my project. I'm hoping some of you guys could help!

Basically, I want to scrape a webpage and access a certain html table and its information. I need to parse out this info and simply format it in a desired result.

So where to begin..... heres my php I have written so far

<?php

$url = "http://www.goldenplec.com/festivals/oxegen-2/oxegen-2011";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table style="background: #FFF; font-size: 13px;"');
$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

echo $table;


/* Regex here to echo the desired result */


?>

That URL contains the table I need. My code will simply echo that exact table.

However, and heres my problem, I'm by no means a reg-ex expert and I need to display the data from the table in a certain format. I want to echo an xml file containing a number of sql insert statements as follows:

$xml_output .= "<statement>INSERT INTO timetable VALUES(1,'Black Eyed Peas','Main Stage','Friday', '23:15')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(2,'Swedish House Mafia','Vodafone Stage','Friday', '23:30')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(3,'Foo Fighters','Main Stage','Saturday', '23:25')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(4,'Deadmau5','Vodafone Stage','Saturday', '23:05')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(5,'Coldplay','Main Stage','Sunday', '22:25')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(6,'Pendalum','Vodafone Stage','Sunday', '22:15')</statement>";

I hope I have provided enough info and I would greatly appreciate any help from you kind folk.

Thanks in advance.


Solution

  • You're much better off using something like XPATH when doing scraping. I get all <TD> elements, identify that the venue is always UPPERCASE, so we can use that to our advantage. We also get a list of days, & some blank spaces, so I skip over those. I identify the start of the acts section via checking for ":", which denotes a time. Given that the event lasts for 3 days & the arrangement of the data interleaves acts for each day, I just increment the day & then reset it when it reaches the last day of the event.

    Possibly some character encoding issues going on here, perhaps, but didn't feel like meddling with that too much. There are possibly more elegant solutions out there.

    Edit: Just noticed that not all acts are exactly interleaved by 3 days, so this will be more difficult to get the day of the event. The code below will not give accurate days for every act. Mainly "Little Green Cars" & "Touchwood"

    Edit2: The code is now updated & should parse all acts properly with correct date. The offending dates that have nothing scheduled are represented by two empty strings(""). We can detect these & increment our $day counter.

    <?php
    
    libxml_use_internal_errors(true);
    
    $url = "lineup2011.html";
    $rawHTML = file_get_contents($url);
    
    $dom = new DOMDocument();
    $dom->loadHTML($rawHTML);
    
    
    $xpath = new DOMXPath($dom);
    
    $nodeList = $xpath->query("//table//td");
    
    $nodeCount = 0;
    $venue = "";
    $day = 0;
    $acts = array();
    
    while ($nodeCount < $nodeList->length) {
        $value = $nodeList->item($nodeCount)->nodeValue;
    
        if (isUpper($value) && strpos($value, ":") === false && $value != "") {
            $venue = $value;
            $nodeCount += 7;
            $day = 0;
            continue;
        }
    
        if ($value == "" && $nodeList->item($nodeCount + 1)->nodeValue == "") {
            $day++;
            $nodeCount += 2;
            continue;
        }
    
        $act = array();
        $act['time'] = $value;
        $act['name'] = $nodeList->item($nodeCount + 1)->nodeValue;
        $act['venue'] = $venue;
    
        $act['day'] = $day % 3;
    
    
        $day++;
    
        $acts[] = $act;
        $nodeCount += 2;
    }
    
    print_r($acts);
    
    
    function isUpper($str) {
        return (strtoupper($str) == $str);
    }