pythonweb-scrapingbeautifulsoupblobbloburls

Is it possible to web scrape a blob URL from a website in python?


I am trying to extract a CSV file which is stored in a blob URL in this domain using beautiful soup: https://worldpopulationreview.com/country-rankings/exports-by-country

Here's my code:

exports  = pd.read_csv(io.StringIO(requests.get(BeautifulSoup(requests.get('https://worldpopulationreview.com/country-rankings/exports-by-country').text,\
        'html.parser').find_all(download="csvData.csv"))))

What I got was an exception and NO blob link in the href. The blob url does exist when I inspect the html on my browser: and here the exception i received

I decided to just do a get request for the blob url itself instead of scraping it since the href does not show the blob url but this exception appears:

requests.exceptions.InvalidSchema: No connection adapters were found for 'blob:https://worldpopulationreview.com/850ac28e-9cd9-46b6-9423-e96a0bd7e938'

Is there a way to web scrape blob URLs?


Solution

  • These blob URLs are created only in the browser, usually with Javascript, they don't exist on the server at all. So you cannot download them with requests.

    You could use a Javascript script in the browser console to get the content, here is an example on how to fetch the blob URL in Javascript: https://stackoverflow.com/a/52410044/

    If you need to do this automatically, you can possibly create a userscript to do it or use an automation tool like AutoHotkey to click th download link automatically.