powershellweb-scraping

Loading A Public Webpage in PowerShell That Requires JS and Blocks Developer Tools


Can someone help me find a way to load a public web page that requires JavaScript and blocks access from developers tools? I had an automated process that that worked as follows.

$TdyDate = $(get-date -f yyyyMMdd)
$wsjurl = "https://www.wsj.com/print-edition/$TdyDate/frontpage"
$wsjweb = Invoke-WebRequest -Uri $wsjurl -UseBasicParsing

This recently started generating "Please enable JS and disable any ad blocker" errors.

Based on this Stack Overflow post I tried the following which gets me past these errors but is only able to pull down an "Access Blocked" landing page instead of the full web page that renders in my browser.

Set-Alias msedge 'C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe'
msedge --headless --dump-dom --disable-gpu $wsjurl

If anyone could help me figure out a way around this, it would be greatly appreciated. The web page I'm targeting is publicly accessible.


Solution

  • The following code snippet could help:

    $wsjDate = Get-Date
    if ( 0 -eq $wsjDate.DayOfWeek.value__ ) {
        $TdyDate = "{0:yyyyMMdd}" -f $wsjDate.AddDays( -1)  # Sunday -> Saturday
    } else {
        $TdyDate = "{0:yyyyMMdd}" -f $wsjDate
    }
    
    $wsjurl = "https://www.wsj.com/print-edition/$TdyDate/frontpage"
    $wsjweb = Invoke-WebRequest -Uri $wsjurl -Method Options -UseBasicParsing
    

    Explanation:

    Moreover, $wsjweb.Headers could enlighten the problem (see properties X-XSS-Protection and X-Content-Type-Options):

    $wsjweb.Headers # truncated

    Key                       Value
    ---                       -----
    …
    X-XSS-Protection          1; mode=block
    X-Content-Type-Options    nosniff
    …