I'm parsing this HTML file
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<figure>
<img src="content/test.svg" alt="">
<figcaption>Test caption.</figcaption>
</figure>
</body>
</html>
with PowerShell 5. While the below approach works well for all relevant tags, including but not limited to div, p, table, td, tr, ... I seem to not figure out where the "Test caption." text is located in the object.
$html = New-Object -Com "HTMLFile";
$html.IHTMLDocument2_write($htmlContent);
$allTags = $html.all;
$allTags[8].tagName # is FIGURE
$allTags[9].tagName # is /FIGURE
But $allTags[8].outerHTML
contains only <FIGCAPTION>
. $allTags[9].outerHTML
contains only </FIGCAPTION>
. innerHTML is empty.
How can $html.documentElement.outerHTML
still contain that figcaption text?
Also this w3schools example indicates that it should work like that. What am I missing? Thanks.
It's a compatibility issue. <figcaption>
requires IE9+. Even if you have the latest IE version installed, the IE COM object might still choose to parse the HTML in compatiblity mode, which happens here.
Insert the X-UA-Compatible
meta tag to force the IE COM object to use the latest IE version:
$htmlContent = @'
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
</head>
<body>
<figure>
<img src="content/test.svg" alt="">
<figcaption>Test caption.</figcaption>
</figure>
</body>
</html>
'@
$html = New-Object -Com HTMLFile
$html.IHTMLDocument2_write($htmlContent)
$allTags = $html.all
$allTags[8].OuterHtml # <figcaption>Test caption.</figcaption>
$allTags[8].InnerHtml # Test caption.
More info: Towards Internet Explorer 11 Compatibility