phphtmlweb-scrapingsimpledom

simplexml doesnt load <a> tag classes?


I have a bit of php that grabs the html from a page and loads it into a simplexml object. However its not getting the classes of the element within a

The php

//load the html page with curl
$html = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);

The page html. Which if I do a var_dump of $html shows its been scraped and exists in $html

    <li class="large">
        <a style="" id="ref_3" class="off" href="#" onmouseover="highlightme('07');return false;" onclick="req('379');return false;" title="">07</a>
    </li>

The var_dump (below) of $doc and $sxml show that the a class of 'off' is now missing. Unfortunately I need to process the page based on this class.

            [8]=>
             object(SimpleXMLElement)#50 (2) {
              ["@attributes"]=>
              array(1) {
                ["class"]=>
                string(16) "large"
              }
              ["a"]=>
              string(2) "08"
            }

Solution

  • Using simplexml_load_file and xpath, see the inline comments.

    What you are after, really, once you found the element you need is this

    $row->a->attributes()->class=="off"
    

    And the full code below:

    // let's take all the divs that have the class "stff_grid"
    $divs = $xml->xpath("//*[@class='stff_grid']");
    
    // for each of these elements, let's print out the value inside the first p tag
    foreach($divs as $div){
        print $div->p->a . PHP_EOL;
    
        // now for each li tag let's print out the contents inside the a tag
        foreach ($div->ul->li as $row){
    
            // same as before
            print "  - " . $row->a;
            if ($row->a->attributes()->class=="off") print " *off*";
            print PHP_EOL;
    
            // or shorter
            // print "  - " . $row->a . (($row->a->attributes()->class=="off")?" *off*":"") . PHP_EOL;
    
        }
    }
    /* this outputs the following
    Person 1
      - 1 hr *off*
      - 2 hr
      - 3 hr *off*
      - 4 hr
      - 5 hr
      - 6 hr *off*
      - 7 hr *off*
      - 8 hr
    Person 2
      - 1 hr
      - 2 hr
      - 3 hr
      - 4 hr
      - 5 hr
      - 6 hr
      - 7 hr *off*
      - 8 hr *off*
    Person 3
      - 1 hr
      - 2 hr
      - 3 hr
      - 4 hr *off*
      - 5 hr
      - 6 hr
      - 7 hr *off*
      - 8 hr
    */