htmlpowershellcomobject

FIGCAPTION ComObject with empty innerHTML and outerHTML when parsed with PowerShell 5?


I'm parsing this HTML file

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="utf-8">
</head>

<body>
    <figure>
        <img src="content/test.svg" alt="">
        <figcaption>Test caption.</figcaption>
    </figure>
</body>
</html>

with PowerShell 5. While the below approach works well for all relevant tags, including but not limited to div, p, table, td, tr, ... I seem to not figure out where the "Test caption." text is located in the object.

$html = New-Object -Com "HTMLFile";
$html.IHTMLDocument2_write($htmlContent);
$allTags = $html.all;
$allTags[8].tagName # is FIGURE
$allTags[9].tagName # is /FIGURE

But $allTags[8].outerHTML contains only <FIGCAPTION>. $allTags[9].outerHTML contains only </FIGCAPTION>. innerHTML is empty.

How can $html.documentElement.outerHTML still contain that figcaption text?

Also this w3schools example indicates that it should work like that. What am I missing? Thanks.


Solution

  • It's a compatibility issue. <figcaption> requires IE9+. Even if you have the latest IE version installed, the IE COM object might still choose to parse the HTML in compatiblity mode, which happens here.

    Insert the X-UA-Compatible meta tag to force the IE COM object to use the latest IE version:

    $htmlContent = @'
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    </head>
    <body>
        <figure>
            <img src="content/test.svg" alt="">
            <figcaption>Test caption.</figcaption>
        </figure>
    </body>
    </html>
    '@
    
    $html = New-Object -Com HTMLFile
    $html.IHTMLDocument2_write($htmlContent)
    
    $allTags = $html.all
    $allTags[8].OuterHtml   # <figcaption>Test caption.</figcaption>
    $allTags[8].InnerHtml   # Test caption.
    

    More info: Towards Internet Explorer 11 Compatibility