pythonweb-scrapingbeautifulsoupwikimedia-commons

How can I create a Python Script with BeautifulSoup on Windows to download the highest resolution of each picture in a WIkimedia Commons folder?


So, I'm a big fan of Gustave Doré, and I would like to download all his engravings from the Wikimedia Commons folders that are neatly organized.

So, given a Wikimedia Commons folder I need to download all the pictures in it in the highest resolution.

I started writing something, but I'm not that good, so it's just a template:

import os, requests, bs4

url = 'URL OF THE WIKIMEDIA COMMONS FOLDER'

os.makedirs('NAME OF THE FOLDER', exist_ok=True)
for n in range(NUMBER OF PICTURES IN THE PAGE - 1):
    print('I am downloading page number %s...' %(n+1))
    res = requests.get(url)
    res.raise_for_status()

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    #STUFF I STILL NEED TO ADD
    
print('Done')

For example, I would feed this as the URL of the folder:

Then I would like to click every link and go to the picture page, like this one:

And then download the 'original file' by clicking the link below the picture that says 'original file'. Except sometimes the pic has no higher resolution available, like in this case:

And it would just need to click the link below the picture to download it.

I am completely stuck, thanks in advance for your help!

Bonus points if the pic has the name stated in its page when saved

(e.g. in the second link the picture should be saved as 'Astonishment of the Crusaders at the Wealth of the East.jpg')


Solution

  • Hey big fan of Gustave Doré, here is a way you can do it

    r = requests.get('https://commons.wikimedia.org/wiki/Category:Crusades_by_Gustave_Dor%C3%A9')
    soup = BeautifulSoup(r.text, 'html.parser')
    links = [i.find('img').get('src') for i in soup.find_all('a', class_='image')]
    links = ['/'.join(i.split('/')[:-1]).replace('/thumb', '') for i in links]
    for l in links:
        im = requests.get(l)
        with open(l.split('/')[-1], 'wb') as f:
            f.write(im.content)