powershellihtmldocument2

HTMLDocumentClass and getElementsByClassName not working


Last year I had powershell (v3) script that parsed HTML of one festival page (and generate XML for my Windows Phone app).

I also was asking a question about it here and it worked like a charm.

But when I run the script this year, it is not working. To be specific - the method getElemntsByClassName is not returning anything. I tried that method also on other web pages with no luck.

Here is my code from last year, that is not working now:

$tmpFile_bandInfo = "C:\band.txt"
Write-Host "Stahuji kapelu $($kap.Nazev) ..." -NoNewline    
Invoke-WebRequest http://www.colours.cz/ucinkujici/the-asteroids-galaxy-tour/ -OutFile $tmpFile_bandInfo
$content = gc $tmpFile_bandInfo -Encoding utf8 -raw
$ParsedHtml = New-Object -com "HTMLFILE"
$ParsedHtml.IHTMLDocument2_write($content)
$ParsedHtml.Close()
$bodyK = $ParsedHtml.body
$bodyK.getElementsByClassName("body four column page") # this returns NULL
$page = $page.item(0)
$aside = $page.getElementsByTagName("aside").item(0)
$img = $aside.getElementsByTagName("img").item(0)
$imgPath = $img.src

this is code I used to workaround this:

$sec = $bodyK.getElementsByTagName("section") | ? ClassName -eq "body four column page"
# but now I have no innerHTML, only the lonely tag SECTION
# so I am walking through siblings
$img = $sec.nextSibling.nextSibling.nextSibling.getElementsByTagName("img").item(0)
$imgPath = $img.src

This works, but this seems silly solution to me.
Anyone knows what I am doing wrong?


Solution

  • I actually solved this problem by abandoning Invoke-WebRequest cmdlet and by adopting HtmlAgilityPack.

    I transformed my former sequential HTML parsing into few XPath queries (everything stayed in powershell script). This solution is much more elegant and HtmlAgilityPack is real badass ;) It is really honour to work with project like this!