My intent was to give an advise on the question Delete everything between two strings (inclusive) to use the HTMLDocument parser instead of a text based replace
command.
But somehow the OuterHTML
property of the <aside>
element doesn't include the concerned element up and till the </aside>
end tag:
$Html = @'
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Some header elements</h1>
<aside>
<p>huge text in between aside</p>
</aside>
<div>
<p>huge text in between div</p>
</div>
<p>Some other elements</p>
</body>
</html>
'@
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
<aside>
$Document.getElementsByTagName('aside') | ForEach-Object { $_.OuterHTML }
<ASIDE>
<div>
$Document.getElementsByTagName('div') | ForEach-Object { $_.OuterHTML }
<DIV><P>huge text in between div</P></DIV>
<aside>
element that explains the difference to other elements as e.g. a <div>
?<aside>
element up and till the </aside>
end tag?I believe the answer has been given in comments by C3roe and Mathias already, the parser isn't able to correctly interpret elements introduced in HTML5, but as a workaround, you can use a more modern parser, for example the one used in ConvertFrom-Html
(default engine is AgilityPack).
$parsed = $html | ConvertFrom-Html
$parsed.SelectSingleNode('//aside').Remove()
$parsed.OuterHtml
# <html>
# <head>
# <title>Title</title>
# </head>
# <body>
# <h1>Some header elements</h1>
#
# <div>
# <p>huge text in between div</p>
# </div>
# <p>Some other elements</p>
# </body>
# </html>
For a simple Html like the one in question you could get away with using XmlDocument
to parse it and then, after selecting the node, target its parent node and then RemoveChild()
.
$xml = [xml]::new()
$xml.PreserveWhitespace = $true
$xml.LoadXml($html)
$node = $xml.SelectSingleNode('//aside')
$null = $node.ParentNode.RemoveChild($node)
$xml.OuterXml