phpxmldomdocumentdomxpath

Extract img src from a text element in an XML feed


I have an XML feed that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<smf:xml-feed xmlns:smf="http://www.simplemachines.org/" xmlns="http://www.simplemachines.org/xml/recent" xml:lang="en-US">
  <recent-post>
    <time>April 04, 2021, 04:20:47 pm</time>
    <id>1909114</id>
    <subject>Title</subject>
    <body><![CDATA[<a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>]]></body>
  </recent-post>
</smf:xml-feed>

I want to extract the image src from the body and then save it to a new XML file that includes an element for image.

So far, I have

$xml = 'https://example.com/feed.xml';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->recover = true;
libxml_use_internal_errors(true);
$dom->loadXML($xml);

$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( 'smf:xml-feed/recent-post/body' );

foreach( $nodes as $node )
{
    $html = new DOMDocument();
    $html->loadHTML( $node->nodeValue );
    $src = $html->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
    echo $src;
}

But when I try to print out $nodes, I get nothing. What am I missing?


Solution

  • This looks like a Simple Machines feed. However the namespaces are missing and the "body" element should be a CDATA section with an html fragment as text. I would expect to look like this:

    <smf:xml-feed 
      xmlns:smf="http://www.simplemachines.org/" 
      xmlns="http://www.simplemachines.org/xml/recent" 
      xml:lang="en-US">
        <recent-post>
        <time>April 04, 2021, 04:20:47 pm</time>
        <id>1909114</id>
        <subject>Title</subject>
        <body><![CDATA[
        <a href="#"><img src="image.png">Lorem ipsum dolor sit amet, consectetur adipisicing elit. Iure rerum in tempore sit ducimus doloribus quod commodi eligendi ipsam porro non fugiat nisi eaque delectus harum aspernatur recusandae incidunt quasi.</a>
        ]]>
        </body>
      </recent-post>
    </smf:xml-feed>
    

    The XML defines two namespaces. To use them in Xpath expressions you have to register prefixes for them. I suggest iterating the recent-post elements. Then fetch the text content of specific child nodes using expression with string casts.

    The body element contains the HTML fragment as text, so you need to load it into a separate document. Then you can Xpath on this document to fetch the src of the img:

    $feedDocument = new DOMDocument();
    $feedDocument->preserveWhiteSpace = false;
    $feedDocument->loadXML($xmlString);
    $feedXpath = new DOMXPath($feedDocument);
    
    // register namespaces
    $feedXpath->registerNamespace('smf', 'http://www.simplemachines.org/');
    $feedXpath->registerNamespace('recent', 'http://www.simplemachines.org/xml/recent');
    
    // iterate the posts
    foreach($feedXpath->evaluate('/smf:xml-feed/recent:recent-post') as $post) {
        // demo: fetch post subject as string
        var_dump($feedXpath->evaluate('string(recent:subject)', $post));
        
        // create a document for the HTML fragment
        $html = new DOMDocument();
        $html->loadHTML(
            // load the text content of the body element
            $feedXpath->evaluate('string(recent:body)', $post),
            // just a fragment, no need for html document elements or DTD
            LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD
        );
        // Xpath instance for the html document
        $htmlXpath = new DOMXpath($html);
        // fetch first src attribute of an img 
        $src = $htmlXpath->evaluate('string(//img/@src)');
        var_dump($src);
    }
    

    Output:

    string(5) "Title"
    string(9) "image.png"