phpdomdomdocumentphp-5.2

Preventing DOMDocument::loadHTML() from converting entities


I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.

I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.

Example code is:

$example = '<ul><li>text</li>'.
    '<li>&frac12; of this is <strong>strong</strong></li></ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));
    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;
}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }

    return $innerHTML;
}

&frac12; ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.


Solution

  • Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.

    It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.

    $example = '<ul><li>text</li>'.
    '<li>&frac12; of this is <strong>strong</strong></li>'.
    '<li>Entity <strong attr="3">in &frac12; tag</strong></li>'.
    '<li>Nested nodes <strong attr="3">in &frac12; <em>tag &frac12;</em></strong></li>'.
    '</ul>';
    
    echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;
    
    $doc = new DOMDocument();
    $doc->resolveExternals = true;
    $doc->substituteEntities = false;
    
    $doc->loadHTML($example);
    
    $domNodeList = $doc->getElementsByTagName('li');
    $count = $domNodeList->length;
    
    for ($idx = 0; $idx < $count; $idx++) {
        $value = trim(_get_inner_html($domNodeList->item($idx)));
    
        /* remainder of processing and storing in database */
        echo 'Saved '.$value.PHP_EOL;
    
    }
    
    function _get_inner_html( $node ) {
        $innerHTML= '';
        $children = $node->childNodes;
        foreach ($children as $child) {
            echo 'Node type is '.$child->nodeType.PHP_EOL;
            switch ($child->nodeType) {
            case 3:
                $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
                break;
            default:
                echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
                echo 'Node name '.$child->nodeName.PHP_EOL;
                $innerHTML .= '<'.$child->nodeName.'>';
                $innerHTML .= _get_inner_html( $child );
                $innerHTML .= '</'.$child->nodeName.'>';
                break;
            }
        }
    
        return $innerHTML;
    }