phpdomdocument

Why is DOMDocument converting both html quote-entities to actual quotes?


I've been at this for half a day, so now it's time to ask for help.

What I'd like is for DOMDocument to leave existing entities and utf-8 characters alone. I'm now thinking this is not possible using only DOMDocument.

$html =
'<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8">
    </head>
    <body>
        <p>&#39; &quot; & &lt; © 庭</p>
    </body>
</html>';

Then I run:

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOERROR);

echo $dom->saveHTML();

And get entity output:

input: &#39; &quot; & &lt; © 庭
output: ' " &amp; &lt; &copy; &#24237;

Why is DOMDocument converting &#39; and &quot; to actual quote marks? The only thing it didn't touch was &lt;.

Pretty sure the copyright symbol is being converted because DOMDocument doesn't think the input html is utf-8, but I'm utterly confused why it's converting the quotes back to non-entities.

I thought the mb_convert_encoding trick would fix the utf-8 issue, but it hasn't.

Neither has the $dom->loadHTML('<?xml encoding="utf-8" ?>'.$html); trick.


Solution

  • I tested about a dozen HTML parsers written in PHP and the only one that worked as expected was HTML5DOMDocument recommended in this stackoverflow answer.

    require 'vendor/autoload.php';
    
    $dom = new IvoPetkov\HTML5DOMDocument();
    $dom->loadHTML($html, LIBXML_NOERROR);
    
    echo $dom->saveHTML();
    

    Result:

    input: &#39; &quot; &lt; © 庭 &nbsp; &
    output: &#39; &quot; &lt; © 庭 &nbsp; &amp;