pythonweb-scrapingpdf-scraping

How to webscrape PDFs that are hidden under the selection option?


I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example:

Then, if I choose Option 1, I something lie this:

Once I press on, e.g., "Clickable link to File 1", picture pops up with an option to "View PDF" in top right corner of the pop up. Now how do I download PDFs in a loop for each of the files under Option 1? I am new to webscraping and your help will be greatly appreciated.

Thanks!


Solution

  • It seems that you can construct PDF Url from the link identifier automatically. For example:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&Code1=&Geo2=&Code2=&GEOCODE=35&type=0"
    map_url = "https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/{id1}/{id2}.pdf"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    for a in soup.select("a[data-dguid]"):
        id_ = a["data-dguid"]
        m = map_url.format(id1=id_[4:9], id2=id_)
        print("{:<60} {}".format(a["data-geoname"], m))
    

    Prints:

    
    ...
    
    Map: Arthur [Population center], Ontario                     https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100022.pdf
    Map: Atikokan [Population center], Ontario                   https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100028.pdf
    Map: Attawapiskat 91A [Population center], Ontario           https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101497.pdf
    Map: Aylmer [Population center], Ontario                     https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100030.pdf
    Map: Ayr [Population center], Ontario                        https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100031.pdf
    Map: Azilda [Population center], Ontario                     https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101498.pdf
    Map: Ballantrae [Population center], Ontario                 https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101370.pdf
    Map: Barrie [Population center], Ontario                     https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100043.pdf
    Map: Barry's Bay [Population center], Ontario                https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05100044.pdf
    Map: Bath [Population center], Ontario                       https://www12.statcan.gc.ca/census-recensement/geo/maps-cartes/pdf/S0510/2016S05101403.pdf
    
    ...