pythonweb-scrapingdata-mining

Scraping slides from SlideShare


I want to scrape slides from https://slideshare.net but as i run a for loop on all the slides only the first slide downloads and the other slide are just blank jpf file. I don't know why. I also tried to scrap each image separately by only the first slide downloads the others are blank.

I expected that I will have all the slides and then i can put it into various formates like pdf, zip and ppt


Solution

  • You can use this example how to get URLs of all slides in jpg form (you can then convert these JPGs into PDF or something else):

    import json
    
    import requests
    from bs4 import BeautifulSoup
    
    # url of slideshow:
    url = "https://www.slideshare.net/slideshow/2024-state-of-marketing-report-by-hubspot/266319371"
    # url = "https://www.slideshare.net/slideshow/image-cryptography-using-rsa-algorithm/249768975"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = json.loads(soup.select_one("#__NEXT_DATA__").text)
    
    slides = data["props"]["pageProps"]["slideshow"]["slides"]
    
    img_url = (
        slides["host"]
        + "/"
        + slides["imageLocation"]
        + "/"
        + str(slides["imageSizes"][-1]["quality"])
        + "/"
        + slides["title"]
        + "-{}-"
        + str(slides["imageSizes"][-1]["width"])
        + ".jpg"
    )
    
    for i in range(1, data["props"]["pageProps"]["slideshow"]["totalSlides"] + 1):
        print(img_url.format(i))
    

    Prints:

    https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-1-2048.jpg
    https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-2-2048.jpg
    
    ...
    
    https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-42-2048.jpg
    https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-43-2048.jpg