htmlpowershellhtml-parsingend-tag

Why isn't the end tag included in an ASIDE.OuterHTML


My intent was to give an advise on the question Delete everything between two strings (inclusive) to use the HTMLDocument parser instead of a text based replace command.
But somehow the OuterHTML property of the <aside> element doesn't include the concerned element up and till the </aside> end tag:

html

$Html = @'
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <h1>Some header elements</h1>
        <aside>
            <p>huge text in between aside</p>
        </aside>
        <div>
            <p>huge text in between div</p>
        </div>
        <p>Some other elements</p>
    </body>
</html>
'@

Parsing

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}
$Document = ParseHtml $Html

<aside>

$Document.getElementsByTagName('aside') | ForEach-Object { $_.OuterHTML }
<ASIDE>

<div>

$Document.getElementsByTagName('div') | ForEach-Object { $_.OuterHTML }

<DIV><P>huge text in between div</P></DIV>

Solution

  • I believe the answer has been given in comments by C3roe and Mathias already, the parser isn't able to correctly interpret elements introduced in HTML5, but as a workaround, you can use a more modern parser, for example the one used in ConvertFrom-Html (default engine is AgilityPack).

    $parsed = $html | ConvertFrom-Html
    $parsed.SelectSingleNode('//aside').Remove()
    $parsed.OuterHtml
    
    # <html>
    #     <head>
    #         <title>Title</title>
    #     </head>
    #     <body>
    #         <h1>Some header elements</h1>
    # 
    #         <div>
    #             <p>huge text in between div</p>
    #         </div>
    #         <p>Some other elements</p>
    #     </body>
    # </html>
    

    For a simple Html like the one in question you could get away with using XmlDocument to parse it and then, after selecting the node, target its parent node and then RemoveChild().

    $xml = [xml]::new()
    $xml.PreserveWhitespace = $true
    $xml.LoadXml($html)
    $node = $xml.SelectSingleNode('//aside')
    $null = $node.ParentNode.RemoveChild($node)
    $xml.OuterXml