python python-3.x web-scraping urllib urlretrieve

urllib urlretrieve only saving final image in list of urls

I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.

I have a list of urls and I am trying to use urllib.request.urlretrieve to download all the images.

The output location (savepath) updates so it adds 1 to the current highest number in the folder.

I've tried a bunch of different ways but urlretrieve only saves the image from the last url in the list. Is there a way to download all the images in the url list?

to_download=['url1','url2','url3','url4']

for t in to_download:
    urllib.request.urlretrieve(t, savepath)

This is the code I was trying to use to update the savepath every time

def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
    if os.path.isfile(os.path.join(photos, f)):
        file_name = os.path.splitext(f)[0]
        try:
            file_num = int(file_name)
            if file_num > highest_num:
                highest_num = file_num
        except ValueError:
            'The file name "%s" is not an integer. Skipping' % file_name

output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file

Solution

as suggested by @vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:

from urllib import request

to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']

for i, url in enumerate(to_download):
    save_path = f'website_{i}.txt'
    print(save_path)
    request.urlretrieve(url, save_path)

which you may want to contract into:

from urllib import request

to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']

[request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]

see:

Python3 doc: Python enumerate doc
Example of enumerate: enumerate example
Example of f' using a string with a {variable}': f string example

FOR SECOND PART OF THE QUESTION:

Not sure what you are trying to achieve but:

def getNextFilePath(photos):
    file_list = os.listdir(photos)
    file_list = [int(s) for s in file_list if s.isdigit()]
    print(file_list)
    max_id_file = max(file_list)
    print(f'max id:{max_id_file}')
    output_file = os.path.join(output_folder, str(max_id_file + 1))
    print(f'output file path:{output_file}')
    return output_file

this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1

I guess that this will replace the save_path in your example.

Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path. The bellow code be a working example using the iterrator:

import os
from urllib import request
photo_folder = os.path.curdir


def getNextFilePath(photos):

    file_list = os.listdir(photos)
    print(file_list)
    file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
    if not file_list:
        return 0
    print(file_list)
    max_id_file = max(file_list)
    #print(f'max id:{max_id_file}')
    #output_file = os.path.join(photo_folder, str(max_id_file + 1))
    #print(f'output file path:{output_file}')
    return max_id_file

def download_pic(to_download):
    start_id = getNextFilePath(photo_folder)


    for i, url in enumerate(to_download):
        save_path = f'{i+start_id}.png'
        output_file = os.path.join(photo_folder, save_path)
        print(output_file)
        request.urlretrieve(url, output_file)

You should add handling exception etc, but this seems to be working, if I understood correctly.