pythonweb-scrapingbeautifulsoupimagedownload

How to web-scrape images which does not have source?


Link:https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0

This website has questions in image form that I need to scrape. However I cannot even get a link to their source and it outputs links to some loading gifs. When I saw the source code, there weren't even any "src" to the images. You can see how the website works on the link provided above. How can I download all these images?

from bs4 import BeautifulSoup
import requests
import os

url = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

images = soup.find_all('img')

for image in images:
    link = image['src']

    print (link)

Solution

  • The question id's are embedded as part of the page, try extracting the id using the re(regex) module.

    import re
    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
    }
    
    URL = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
    BASE_URL = "https://www.exam-mate.com"
    
    soup = BeautifulSoup(requests.get(URL).content, "html.parser")
    
    for tag in soup.select("td:nth-of-type(1) a"):
        # Find the question id within the page
        question_link = re.search(r"/questions.*\.png", tag["onclick"]).group()
        print(BASE_URL + question_link)
    

    Output:

    https://www.exam-mate.com/questions/1240/1362/1240_q_1362_1_1.png
    https://www.exam-mate.com/questions/1240/1363/1240_q_1363_2_1.png
    https://www.exam-mate.com/questions/1240/1364/1240_q_1364_3_1.png
    https://www.exam-mate.com/questions/1240/1365/1240_q_1365_4_1.png
    https://www.exam-mate.com/questions/1240/1366/1240_q_1366_5_1.png
    ...And on