htmlpowershellpowershell-core

How to parse HTML table with Powershell Core 7?


I have the following code:

    $html = New-Object -ComObject "HTMLFile"
    $source = Get-Content -Path $FilePath -Raw
    try
    {
        $html.IHTMLDocument2_write($source) 2> $null
    }
    catch
    {
        $encoded = [Text.Encoding]::Unicode.GetBytes($source)
        $html.write($encoded)
    }
    $t = $html.getElementsByTagName("table") | Where-Object {
        $cells = $_.tBodies[0].rows[0].cells
        $cells[0].innerText -eq "Name" -and
        $cells[1].innerText -eq "Description" -and
        $cells[2].innerText -eq "Default Value" -and
        $cells[3].innerText -eq "Release"
    }

The code works fine on Windows Powershell 5.1, but on Powershell Core 7 $_.tBodies[0].rows returns null.

So, how does one access the rows of an HTML table in PS 7?


Solution

  • PowerShell (Core) 7, as of v7.4, does not come with a built-in HTML parser - and this may never change.

    You must rely on a third-party solution, such as the PSParseHTML module that wraps both the HTML Agility Pack[1] and the AngleSharp library. The former is used by default, the latter requires opt-in -Engine AngleSharp; as for their respective DOMs (object models):


    Self-contained sample code that uses the HTML Agility Pack engine:

    # Install the module on demand
    If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
      Write-Verbose "Installing PSParseHTML module for the current user..."
      Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
    }
    
    # Create a sample HTML file with a table with 2 columns.
    Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html
    
    # Parse the HTML file into an HTML DOM.
    $htmlDom = Get-Content -Raw sample.html | ConvertFrom-Html
    
    # Find a specific table by its column names, using an XPath
    # query to iterate over all tables.
    $table = $htmlDom.SelectNodes('//table') | Where-Object {
      $headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
      # Filter by column names
      $headerRow.ChildNodes[0].InnerText -eq 'Name' -and 
        $headerRow.ChildNodes[1].InnerText -eq 'Mode'
    }
    
    # Print the table's HTML text.
    $table.InnerHtml
    
    # Extract the first data row's first column value.
    # Note: @(...) is required around .Elements() for indexing to work.
    @($table.Elements('tr'))[1].ChildNodes[0].InnerText
    

    A Windows-only alternative is to use the HTMLFile COM object, as shown in this answer, and as used in your own attempt - I'm unclear on why it didn't work in your specific case.


    [1] Note that this answer originally built on a different PowerShell wrapper module for the HTML Agility Pack, PowerHTML - however, the PSParseHTML is more actively maintained.

    [2] Notably with respect to supporting XPath queries via the .SelectSingleNode() and .SelectNodes() methods, exposing child nodes via a .ChildNodes collection, and providing .InnerHtml / .OuterHtml / .InnerText properties. Instead of an indexer that supports child element names, methods .Element(<name>) and .Elements(<name>) are provided.