I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here's a sample of what I'm trying to sanitize.
<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
<head>
<meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
</head>
<body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
<div class=WordSection1>
<h1>Pros and Cons of a Website</h1>
<p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p> </o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
<p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
</p>
</div>
<div class=WordSection2>...same pattern in div 1</div>
<div class=WordSection3>...same...</div>
</body>
</html>
What I need from all of this is:
<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
What I have so far:
$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
if($node->tagName=='script') $node->parentNode->removeChild($node);
if($node->tagName=='a') continue;
$attrs = $xpath->query('@*', $node);
foreach($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));
It gives me:
<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
<div>
<h1>Pros and Cons of a Website</h1>
<p><p> </p></p>
<p>A SAMPLE TEXT</p>
</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
</body>
which I'm good with, but I want the body tag out. I also want h1 and it's content out too, but when I say:
if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
something weird happens:
<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>
I've come across some very good answers like:
The removal of h1
shifts the list of $nodes
, causing <p class="MsoBodyText">
to be skipped in the next iteration. To avoid this, replace foreach
with a for
loop and decrement the current index whenever a node is removed.
$dom = new DOMDocument;
@$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$bodyNode = $xpath->query('//html/body')->item(0);
$nodes = $bodyNode->getElementsByTagName('*');
for ($i = 0; $i < $nodes->count(); $i++) {
$node = $nodes->item($i);
if ($node->tagName == 'script' || $node->tagName == 'h1') {
$node->parentNode->removeChild($node);
$i--;
}
if ($node->tagName == 'a') {
continue;
}
$attrs = $xpath->query('@*', $node);
foreach ($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($bodyNode)) . PHP_EOL;
Then, the saveHTML()
function can be invoked for each child node, resulting in a combined output that omits the parent body
tag.
$inner = [];
foreach ($bodyNode->childNodes as $node) {
$inner []= trim($bodyNode->ownerDocument->saveHTML($node));
}
echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;
As an alternative, extract the text alone and recreate the wrapping tag.
$inner = [];
foreach ($bodyNode->childNodes as $node) {
$text = trim($node->textContent);
if ($node->nodeType != XML_ELEMENT_NODE) {
$inner []= $text;
continue;
}
$inner []= sprintf('<%s>%s</%s>',
$node->tagName, $text, $node->tagName);
}
echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;