htmlsimplexmldomdocumentw3c-validationphpquery

DOMDocument saveHTML is not returning correct HTML Standards for "IMG", "INPUT"


I'm a big fan of the PHP library phpQuery content parser (because its quite like jQuery, while using the PHP DOMDocument to extract Markup) but I've noticed a bug with specific elements with the quick closing event <img /> instead of <div></div>

I've noticed this bug also occurs in DOMDocument as well as phpQuery.

I've written a simple class PhpContentDocument to dump a simple html document.

require_once "../phpquery_lib/phpQuery.php";
require_once "PhpContentDocument.class.php";

$sample_document = new PhpContentDocument('Sample Document');
$sample_document->addElement('text element', "<span class='text_element'>This is some Sample Text</span>");
$sample_document->addElement('image element', "<img src='png_file.png' alt='png_file' id='png_file' />");

$sample_document_string = $sample_document->get_string();

The results are what you would expect ...

<!DOCTYPE HTML>
<html>
<head>
<title>Sample Document</title>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
<body>
<span class='text_element'>This is some Sample Text</span>
<img src='png_file.png' alt='png_file' id='png_file' />
</body>
</html>

But when recalling the document using saveHTML

$php_query_document = new DOMDocument('UTF-8', '1.0');
$php_query_document->formatOutput = true;
$php_query_document->preserveWhiteSpace = true;
$php_query_document->loadHTML($sample_document_string);

$php_query_document_string = $php_query_document->saveHTML();

echo $php_query_document_string;

it returns ...

<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<title>Sample Document</title>
</head>
<body>
<span class="text_element">This is some Sample Text</span>
<img src="png_file.png" alt="png_file" id="png_file">
</body>
</html>

The main problem I have with this, is when I use SimpleXMLElement on the element img#png_file (for example)

Using content parser passing <img src="png_file.png" alt="png_file" id="png_file"> as the argument

$simple_doc = new SimpleXMLElement((string) $php_query_document->find('img#png_file'));

I get the following warnings and exceptions, even though my original markup would work with the SimpleXMLElement.

Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : Premature end of data in tag img line 1 in F:\xampp\htdocs\Test_Code\phpquery_test_items\index.php on line 17

Warning: SimpleXMLElement::__construct(): <img src="png_file.png" alt="png_file" id="png_file"> in F:\xampp\htdocs\Test_Code\phpquery_test_items\index.php on line 17

Warning: SimpleXMLElement::__construct(): ^ in F:\xampp\htdocs\Test_Code\phpquery_test_items\index.php on line 17

Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in F:\xampp\htdocs\Test_Code\phpquery_test_items\index.php:17 Stack trace: #0 F:\xampp\htdocs\Test_Code\phpquery_test_items\index.php(17): SimpleXMLElement->__construct('<img src="png_f...') #1 {main} thrown in F:\xampp\htdocs\Test_Code\phpquery_test_items\index.php on line 17

Due to the element having no closing event.

TL:DR Warning: SimpleXMLElement::__construct(): Entity: line 1: parser error : Premature end of data in tag img line 1

How can I fix this? I do have some ideas but preferably


Solution

  • If you use DOMDocument::saveXML() instead of DOMDocument::saveHTML() you'll get valid XML.

    If necessary you could then strip the xml declaration line <?xml version="1.0" encoding="UTF-8" standalone="yes"?>.


    I just realized you want the find() method to return the proper XML. Therefore I'm not sure my above-mentioned suggestion is all that helpful, if it means you have to alter the class that implements that method.

    Perhaps you could do something a little convoluted like:

    $node = $php_query_document->find('img#png_file');
    $simple_doc = new SimpleXMLElement( $node->ownerDocument->saveXML( $node ) );
    

    This presupposes $node is some implementation of DOMNode, which I suspect it is. What this does is ask the $node->ownerDocument (the DOMDocument that contains the node) to save only that specific node as XML.


    Another possibility (which I would not necessarily recommend) is to let SimpleXML be lenient, when parsing, by passing the following libxml options to the constructor:

    $simple_doc = new SimpleXMLElement(
        (string) $php_query_document->find('img#png_file'), 
        LIBXML_NOERROR | LIBXML_ERR_NONE | LIBXML_ERR_FATAL
    );
    

    This suppresses libxml errors while parsing the content. libxml is the underlying XML parser, used by SimpleXML and DOMDocument (amongst others).