javascriptpythonhtmlweb-scrapingbeautifulsoup

How to parse HTML hidden behind JS scripts


The FCC has a database with details about various broadcast licenses. Many of these licenses have pages like this one

Most of the data on these pages (and related ones) can be scraped very easily with a combination of the standard requests library and BeautifulSoup4. You just scrape the HTML, target the data you want, and you're good to go.

I took the same approach to extracting this Spectrum and Market Area table (pictured below), but have run into a roadblock. Table I want to extract from the page

Though the individual table rows can be inspected with browser dev tools, when I scrape the HTML with something like this:

import requests

url = "https://wireless2.fcc.gov/UlsApp/UlsSearch/leasesList.jsp?licKey=2591153"
output_file = "license_page_leases.html"
response = requests.get(url)
    
with open(output_file, "w", encoding="utf-8") as file:
    file.write(response.text)

... no part of the table itself is downloaded - all I get is the javascript that appears to generate the table.

So my question: how do I scrape the data in this kind of table?

I've tried various similar ways of scraping this table.

I've also tried to see if theres some underlying query structure available that would let me make a request more directly to their DB, to no success.

If there's an obvious solution I'm missing, I'd love to know, but I don't necessarily need anyone to solve this problem for me - I'm here because I need advice on how to research this structure. I don't know what to call it when the html is generated on the fly like this, so its hard to research methods to grab it.

Thanks!


Solution

  • I have no doubt that a selenium-based approach has merit, but @AKX's suggestion in the comments to pull the .jsonb file from the Network inspector was the simplest solution.

    I hadn't thought to do this, as this is non-standard behaviour for the webtool, from what I have seen so far.

    Doing this let me have access to a a bunch of json data wrapped in eval(), i.e eval({...}). Trivial from there.