phpdomdocument

How can I remove tag names but leave the inner html contents using DOMDocument


I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here's a sample of what I'm trying to sanitize.

<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
    <head>
        <meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
    </head>
    <body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
        <div class=WordSection1>
            <h1>Pros and Cons of a Website</h1>
            <p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p>&nbsp;</o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
            <p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
                A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
            </p>
        </div>
        <div class=WordSection2>...same pattern in div 1</div>
        <div class=WordSection3>...same...</div>
   </body>
</html>

What I need from all of this is:

<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>

What I have so far:

$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
    if($node->tagName=='script') $node->parentNode->removeChild($node);
    if($node->tagName=='a') continue;
    $attrs = $xpath->query('@*', $node);
    foreach($attrs as $attr) {
        $attr->parentNode->removeAttribute($attr->nodeName);
    }
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));

It gives me:

<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
    <div>
        <h1>Pros and Cons of a Website</h1>
        <p><p> </p></p>
        <p>A SAMPLE TEXT</p>
    </div>
    <div>...same pattern in div 1</div>
    <div>...same...</div>
</body>

which I'm good with, but I want the body tag out. I also want h1 and it's content out too, but when I say:

if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);

something weird happens:

<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>

I've come across some very good answers like:

  1. How to get innerHTML of DOMNode? (Haim Evgi's answer, I don't know how to properly implement it, Keyacom's answer too), Marco Marsala's answer is the closest I got but the divs all kept their classes.

Solution

  • The removal of h1 shifts the list of $nodes, causing <p class="MsoBodyText"> to be skipped in the next iteration. To avoid this, replace foreach with a for loop and decrement the current index whenever a node is removed.

    $dom = new DOMDocument;
    @$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($dom);
    
    $bodyNode = $xpath->query('//html/body')->item(0);
    $nodes = $bodyNode->getElementsByTagName('*');
    
    for ($i = 0; $i < $nodes->count(); $i++) {
        $node = $nodes->item($i);
        if ($node->tagName == 'script' || $node->tagName == 'h1') {
            $node->parentNode->removeChild($node);
            $i--;
        }
        if ($node->tagName == 'a') {
            continue;
        }
        $attrs = $xpath->query('@*', $node);
        foreach ($attrs as $attr) {
            $attr->parentNode->removeAttribute($attr->nodeName);
        }
    }
    echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($bodyNode)) . PHP_EOL;
    

    Then, the saveHTML() function can be invoked for each child node, resulting in a combined output that omits the parent body tag.

    $inner = [];
    foreach ($bodyNode->childNodes as $node) {
        $inner []= trim($bodyNode->ownerDocument->saveHTML($node));
    }
    echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;
    

    As an alternative, extract the text alone and recreate the wrapping tag.

    $inner = [];
    foreach ($bodyNode->childNodes as $node) {
        $text = trim($node->textContent);
        if ($node->nodeType != XML_ELEMENT_NODE) {
            $inner []= $text;
            continue;
        }
        $inner []= sprintf('<%s>%s</%s>',
            $node->tagName, $text, $node->tagName);
    }
    echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;