phpweb-scrapingdomxpath

How to scrape content from the simpleXML_Element_Object?


I am trying to scrape the contents from the Information Box of wikipedia on the right hand side of any wikipage.

I am using DOMXpath to scrape the contents.

On this link's Information box(on the right hand side), I am trying to scrape the Traded as section. However in the page source it is formed of multiple href's.

Traded as:  NASDAQ: GOOG
            NASDAQ-100 Component
            S&P 500 Component

And My SIMPLE_XML_Element_Object looks like this

SimpleXMLElement object {
 @attributes => array(1) (
[class] => (string)
)
 th => SimpleXMLElement object {
@attributes => array(2) (
  [scope] => (string) row
  [style] => (string) text-align:left;
)
a => (string) Traded as
}
td => SimpleXMLElement object {
@attributes => array(2) (
  [class] => (string)
  [style] => (string)
)
a => array(4) (
  [0] => (string) NASDAQ
  [1] => (string) GOOG
  [2] => (string) NASDAQ-100 Component
  [3] => (string) S&P 500 Component
)
}

and this is what I have tried to scrape the contents.

foreach ($xmlElements->xpath("//div[@id='mw-content-text']/table[@class='infobox vcard']/tr") as $node) 
{
   $name = (string)$node->th;
   if(empty($name))
     $name = (string)$node->th->a;
   if(is_array($node->td->a))
       $value = implode('~', (string) $node->td->a);
    else
       $value = (string) $node->td->a;
}

However I am not able to get the value formed as "NASDAQ: GOOD ~ NASDAQ-100 Component ~ NASDAQ-100 Component" and I am the getting the value as "NASDAQ" alone, that is not the required one.

How to get the value from the node if it is an array?

Hope I am clear with my question. Any help would be appreciated.


Solution

  • Please see http://www.laprbass.com/RAY_temp_user1518659.php

    Outputs: string(64) "NASDAQ: GOOG ~ NASDAQ-100 Component ~ S&P 500 Component"

    This is really much easier to get right if you just use native PHP functions!

    <?php // RAY_temp_user1518659.php
    error_reporting(E_ALL);
    echo '<pre>';
    
    // ACQUIRE THE DOCUMENT
    $url = 'http://en.wikipedia.org/wiki/Google';
    $htm = file_get_contents($url);
    
    // ACTIVATE THIS TO SEE THE ENTIRE DOCUMENT
    // echo htmlentities($htm);
    
    // ISOLATE THE "TRADED AS" PART
    $sig = 'Traded as';
    $arr = explode($sig, $htm);
    $htm = $arr[1];
    $sig = '</tr>';
    $arr = explode($sig, $htm);
    $htm = $arr[0];
    
    // REFORMAT THE DATA INTO A TILDE-SEPARATED STRING
    $new = trim(strip_tags($htm));
    $new = explode(PHP_EOL, $new);
    $new = implode(' ~ ', $new);
    
    // SHOW THE WORK PRODUCT
    var_dump($new);
    

    Best regards, ~Ray