I want to scrape slides from https://slideshare.net but as i run a for loop on all the slides only the first slide downloads and the other slide are just blank jpf file. I don't know why. I also tried to scrap each image separately by only the first slide downloads the others are blank.
I expected that I will have all the slides and then i can put it into various formates like pdf, zip and ppt
You can use this example how to get URLs of all slides in jpg form (you can then convert these JPGs into PDF or something else):
import json
import requests
from bs4 import BeautifulSoup
# url of slideshow:
url = "https://www.slideshare.net/slideshow/2024-state-of-marketing-report-by-hubspot/266319371"
# url = "https://www.slideshare.net/slideshow/image-cryptography-using-rsa-algorithm/249768975"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)
slides = data["props"]["pageProps"]["slideshow"]["slides"]
img_url = (
slides["host"]
+ "/"
+ slides["imageLocation"]
+ "/"
+ str(slides["imageSizes"][-1]["quality"])
+ "/"
+ slides["title"]
+ "-{}-"
+ str(slides["imageSizes"][-1]["width"])
+ ".jpg"
)
for i in range(1, data["props"]["pageProps"]["slideshow"]["totalSlides"] + 1):
print(img_url.format(i))
Prints:
https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-1-2048.jpg
https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-2-2048.jpg
...
https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-42-2048.jpg
https://image.slidesharecdn.com/1707826910254-240215090210-009c7a2b/75/2024-State-of-Marketing-Report-by-Hubspot-43-2048.jpg