I have the following code:
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content -Path $FilePath -Raw
try
{
$html.IHTMLDocument2_write($source) 2> $null
}
catch
{
$encoded = [Text.Encoding]::Unicode.GetBytes($source)
$html.write($encoded)
}
$t = $html.getElementsByTagName("table") | Where-Object {
$cells = $_.tBodies[0].rows[0].cells
$cells[0].innerText -eq "Name" -and
$cells[1].innerText -eq "Description" -and
$cells[2].innerText -eq "Default Value" -and
$cells[3].innerText -eq "Release"
}
The code works fine on Windows Powershell 5.1, but on Powershell Core 7 $_.tBodies[0].rows
returns null.
So, how does one access the rows of an HTML table in PS 7?
PowerShell (Core) 7, as of v7.4, does not come with a built-in HTML parser - and this may never change.
You must rely on a third-party solution, such as the PSParseHTML
module that wraps both the HTML Agility Pack[1] and the AngleSharp library. The former is used by default, the latter requires opt-in -Engine AngleSharp
; as for their respective DOMs (object models):
The HTML Agility Pack, which is used by default, works differently than the Internet Explorer-based one available in Windows PowerShell; it is similar to the XML DOM provided by the standard System.Xml.XmlDocument
type ([xml]
)[2]; see the documentation and the sample code below.
AngleSharp, which requires opt-in via -Engine AngleSharp
, is built upon the official W3C specification and therefore provides an HTML DOM as available in web browsers. Notably, this means that its .QuerySelector()
and .QuerySelectorAll()
methods can be used with the usual CSS selectors. See this answer for an example of its use.
Self-contained sample code that uses the HTML Agility Pack engine:
# Install the module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
Write-Verbose "Installing PSParseHTML module for the current user..."
Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
}
# Create a sample HTML file with a table with 2 columns.
Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html
# Parse the HTML file into an HTML DOM.
$htmlDom = Get-Content -Raw sample.html | ConvertFrom-Html
# Find a specific table by its column names, using an XPath
# query to iterate over all tables.
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Name' -and
$headerRow.ChildNodes[1].InnerText -eq 'Mode'
}
# Print the table's HTML text.
$table.InnerHtml
# Extract the first data row's first column value.
# Note: @(...) is required around .Elements() for indexing to work.
@($table.Elements('tr'))[1].ChildNodes[0].InnerText
A Windows-only alternative is to use the HTMLFile
COM object, as shown in this answer, and as used in your own attempt - I'm unclear on why it didn't work in your specific case.
[1] Note that this answer originally built on a different PowerShell wrapper module for the HTML Agility Pack, PowerHTML
- however, the PSParseHTML
is more actively maintained.
[2] Notably with respect to supporting XPath queries via the .SelectSingleNode()
and .SelectNodes()
methods, exposing child nodes via a .ChildNodes
collection, and providing .InnerHtml
/ .OuterHtml
/ .InnerText
properties. Instead of an indexer that supports child element names, methods .Element(<name>)
and .Elements(<name>)
are provided.