I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.
I have a list of urls and I am trying to use urllib.request.urlretrieve
to download all the images.
The output location (savepath
) updates so it adds 1 to the current highest number in the folder.
I've tried a bunch of different ways but urlretrieve
only saves the image from the last url in the list. Is there a way to download all the images in the url list?
to_download=['url1','url2','url3','url4']
for t in to_download:
urllib.request.urlretrieve(t, savepath)
This is the code I was trying to use to update the savepath
every time
def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
if os.path.isfile(os.path.join(photos, f)):
file_name = os.path.splitext(f)[0]
try:
file_num = int(file_name)
if file_num > highest_num:
highest_num = file_num
except ValueError:
'The file name "%s" is not an integer. Skipping' % file_name
output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file
as suggested by @vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
for i, url in enumerate(to_download):
save_path = f'website_{i}.txt'
print(save_path)
request.urlretrieve(url, save_path)
which you may want to contract into:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
[request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]
see:
FOR SECOND PART OF THE QUESTION:
Not sure what you are trying to achieve but:
def getNextFilePath(photos):
file_list = os.listdir(photos)
file_list = [int(s) for s in file_list if s.isdigit()]
print(file_list)
max_id_file = max(file_list)
print(f'max id:{max_id_file}')
output_file = os.path.join(output_folder, str(max_id_file + 1))
print(f'output file path:{output_file}')
return output_file
this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1
I guess that this will replace the save_path in your example.
Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path. The bellow code be a working example using the iterrator:
import os
from urllib import request
photo_folder = os.path.curdir
def getNextFilePath(photos):
file_list = os.listdir(photos)
print(file_list)
file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
if not file_list:
return 0
print(file_list)
max_id_file = max(file_list)
#print(f'max id:{max_id_file}')
#output_file = os.path.join(photo_folder, str(max_id_file + 1))
#print(f'output file path:{output_file}')
return max_id_file
def download_pic(to_download):
start_id = getNextFilePath(photo_folder)
for i, url in enumerate(to_download):
save_path = f'{i+start_id}.png'
output_file = os.path.join(photo_folder, save_path)
print(output_file)
request.urlretrieve(url, output_file)
You should add handling exception etc, but this seems to be working, if I understood correctly.