pythonpython-3.xweb-scrapingurlliburlretrieve

urllib urlretrieve only saving final image in list of urls


I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.

I have a list of urls and I am trying to use urllib.request.urlretrieve to download all the images.

The output location (savepath) updates so it adds 1 to the current highest number in the folder.

I've tried a bunch of different ways but urlretrieve only saves the image from the last url in the list. Is there a way to download all the images in the url list?

to_download=['url1','url2','url3','url4']

for t in to_download:
    urllib.request.urlretrieve(t, savepath)

This is the code I was trying to use to update the savepath every time

def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
    if os.path.isfile(os.path.join(photos, f)):
        file_name = os.path.splitext(f)[0]
        try:
            file_num = int(file_name)
            if file_num > highest_num:
                highest_num = file_num
        except ValueError:
            'The file name "%s" is not an integer. Skipping' % file_name

output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file

Solution

  • as suggested by @vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:

    from urllib import request
    
    to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
    
    for i, url in enumerate(to_download):
        save_path = f'website_{i}.txt'
        print(save_path)
        request.urlretrieve(url, save_path)
    
    

    which you may want to contract into:

    from urllib import request
    
    to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
    
    [request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]
    

    see:

    FOR SECOND PART OF THE QUESTION:

    Not sure what you are trying to achieve but:

    def getNextFilePath(photos):
        file_list = os.listdir(photos)
        file_list = [int(s) for s in file_list if s.isdigit()]
        print(file_list)
        max_id_file = max(file_list)
        print(f'max id:{max_id_file}')
        output_file = os.path.join(output_folder, str(max_id_file + 1))
        print(f'output file path:{output_file}')
        return output_file
    

    this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1

    I guess that this will replace the save_path in your example.

    Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path. The bellow code be a working example using the iterrator:

    import os
    from urllib import request
    photo_folder = os.path.curdir
    
    
    def getNextFilePath(photos):
    
        file_list = os.listdir(photos)
        print(file_list)
        file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
        if not file_list:
            return 0
        print(file_list)
        max_id_file = max(file_list)
        #print(f'max id:{max_id_file}')
        #output_file = os.path.join(photo_folder, str(max_id_file + 1))
        #print(f'output file path:{output_file}')
        return max_id_file
    
    def download_pic(to_download):
        start_id = getNextFilePath(photo_folder)
    
    
        for i, url in enumerate(to_download):
            save_path = f'{i+start_id}.png'
            output_file = os.path.join(photo_folder, save_path)
            print(output_file)
            request.urlretrieve(url, output_file)
    
    
    

    You should add handling exception etc, but this seems to be working, if I understood correctly.