htmlpowershelldomhtml-parsing

How to parse the HTML of a website with PowerShell


I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far

$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)


foreach ($obj in $HTML.all) { 
    $obj.getElementsByClassName('some-class-name') 
}

I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.

So after spending two days, how am I supposed to parse HTML with Powershell?

So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.


Solution

  • If installing a third-party module is an option:


    A self-contained example based on the AngleSharp engine that parses the home page of the English Wikipedia and extracts all HTML elements whose class attribute value is vector-menu-content-list:

    # Install the PSParseHTML module on demand
    If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
      Write-Verbose "Installing PSParseHTML module for the current user..."
      Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
    }
    
    # Using the AngleSharp engine, parse the home page of the English Wikipedia
    # into an HTML DOM.
    $htmlDom = ConvertFrom-Html -Engine AngleSharp -Url https://en.wikipedia.org
    
    # Extract all HTML elements with a 'class' attribute value of 'vector-menu-content-list'
    # and output their text content (.TextContent)
    $htmlDom.QuerySelectorAll('.vector-menu-content-list').TextContent