I am trying to get the inner HTML of a DOMElement
in PHP. Example markup:
<div>...</div>
<div id="target"><p>Here's some <em>funny</em> text</p></div>
<div>...</div>
<div>...</div>
Feeding the above string into the variable $html
, I am doing:
$doc = new DOMDocument();
@$doc->loadHTML("<html><body>$html</body></html>");
$node = $doc->getElementById('target')
$markup = '';
foreach ($node->childNodes as $child) {
$markup .= $child->ownerDocument->saveXML($child);
}
The resulting $markup
string looks like this (converted to JSON to reveal the invisible characters):
"<p>Here's some \u00a0 <em>funny<\/em> \u00a0 text<\/p>"
All
characters have been converted to Unicode non-breaking spaces, which breaks my application.
In my ideal world, there would be a way to retrieve the original string of HTML inside the target div as-is, without DomDocument
doing anything to it at all. That doesn't seem to be possible, so the next best thing would be to somehow turn off this character conversion. So far I've tried:
$doc->substituteEntities = false;
with no result. Changing it to true
doesn't help either.$doc->preserveWhiteSpace
with no change either waysaveXML
to saveHTML
. Doesn't make a difference.Finally I resorted to tacking on this hack, which works but doesn't feel like the right solution.
$markup = str_replace("\xc2\xa0", ' ', $markup);
Surely there is a better way?
You can use the very cryptic function mb_encode_numericentity()
to convert those characters outside of the visible ASCII range, so it won't touch your markup and such:
<?php
$html = <<< HTML
<div>...</div>
<div id="target"><p>Here's some <em>funny 😂</em> text</p></div>
<div>...</div>
<div>...</div>
HTML;
$doc = new DOMDocument();
libxml_use_internal_errors();
$doc->loadHTML("<html><head><meta charset=UTF-8></head><body>$html</body></html>");
$node = $doc->getElementById('target');
$markup = '';
foreach ($node->childNodes as $child) {
$markup .= $child->ownerDocument->saveHTML($child);
}
$convmap = [
0x00, 0x1f, 0, 0xff,
0x7f, 0x10ffff, 0, 0xffffff,
];
$markup = mb_encode_numericentity($markup, $convmap, "UTF-8");
echo $markup;
Output:
<p>Here's some   <em>funny 😂</em>   text</p>
Outside of the scope of the original question, but I've added an emoji to the string as well. To encode multibyte characters, <meta charset="UTF-8">
will force PHP to treat the content as Unicode instead of its default ISO-8859-1.