python image multiprocessing python-imaging-library pool

Python Multiprocessing Pool Download Errors

Im using python multiproccing pool to download thousands of images and process these with python PIL

All works as should except when a image downloads and is corrupt, then PIL throws an error

Im looking for advice on how to re loop the pool, maybe just re downloading the image or the whole pool, the total data per pool is around 15Mb

I check the returned pool data array is the expected length but then the next step throws the error because the image is corrupt.

Pool code


    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    func = partial(url_downloader, map_id)
    data = pool.map(func, url_list)
    pool.close()
    pool.join()

    if len(data) == len(url_list):
        for d in data:
            image = Image.open(BytesIO(d[0]))
            dst.paste(image, (d[1], d[2]))
    else:
        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, 'data size mismatch, skipping'))
        return

    exif_data = dst.getexif()
    # https://www.awaresystems.be/imaging/tiff/tifftags/extension.html
    # 0x270 ImageDescription - A string that describes the subject of the image
    # 0x269 DocumentName - The name of the document from which this image was scanned.
    # 0x285 PageName - The name of the page from which this image was scanned.
    exif_data[0x269] = str(helpers.normalizefilename(page_meta[0]))

    dst.save(os.path.join(image_folder, master_image_name), exif=exif_data)
    helpers.write_to_file(os.path.join(os.getcwd(), 'index.txt'), 'a+', index_text)

Download function

def url_downloader(map_id, url):

    header = {"User-Agent": "Mozilla/5.0 (X11; CrOS "
                            "x86_64 12871.102.0) "
                            "AppleWebKit/537.36 (KHTML, "
                            "like Gecko) "
                            "Chrome/81.0.4044.141 "
                            "Safari/537.36"}

    try:
        response = requests.get(url[0], headers=header)
        if response.status_code == 200:
            image_data = response.content
            return [image_data, url[1], url[2]]
    except requests.exceptions.RequestException as e:
        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))
        return

Error as requested

Traceback (most recent call last):
  File "/home/james/mapgrabber/./map-grabber.py", line 291, in <module>
    main()
  File "/home/james/mapgrabber/./map-grabber.py", line 69, in main
    auto_map_grabber(save_path, conn)
  File "/home/james/mapgrabber/./map-grabber.py", line 166, in auto_map_grabber
    map_builder(m[1], save_path, conn)
  File "/home/james/mapgrabber/./map-grabber.py", line 247, in map_builder
    image = Image.open(BytesIO(d[0]))
TypeError: 'NoneType' object is not subscriptable

Edit:

For now I have added a simple try, except function, maybe a limit on the retries? I'm guessing usually its just a single bad download so this should suffice

Edit 2:

I have tested this further by saving the tiles into a directory for trouble shooting, i did go down the route of checking the tile size's as i thought it was failed download's but upon checking the directory of tile's I can see all the images download correctly but sometimes they fail to be pasted correctly onto the larger image, its about 1 in 20 or so, i wonder if I'm causing some memory issues and causing a glitch somewhere. Checking the image size or validity cant help as there seems to be no issues there and if there is i catch it with my requests response.

current code

 pool = multiprocessing.Pool(multiprocessing.cpu_count())
    func = partial(url_downloader, map_id)
    data = pool.map(func, url_list)
    pool.close()
    pool.join()

    for d in data:
        try:
            image = Image.open(BytesIO(d[0]))
            dst.paste(image, (d[1], d[2]))
            image.close()
        except Exception as e:
            helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))
            map_builder(map_id, save_path, conn)

dst is the main image created earlier in the script using the total dimensions of the image, then each piece is pasted in based on its coords.

working perfectly most of the time. i just cant seem to find the reason for the missing tiles.

Edit 3 in repsponse to Booboo

Why is map_id being passed to url_downloader when it does not seem to be used?

map_id is used when writing the log file on an exception which never occurs, if the download fails its generally much further up in the function when it realises there is an issue with the url.

 try:
        response = requests.get(url[0], headers=header)
        if response.status_code == 200:
            image_data = response.content
            return [image_data, url[1], url[2]]
        else:
            return [None, url[1], url[2]]
    except requests.exceptions.RequestException as e:
        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))
        return [None, url[1], url[2]]

2 .On a failed pasted operation, you execute map_builder(map_id, save_path, conn). This is meaningless to us readers since neither the function nor its arguments are defined anywhere. Also, this is executed after an exception has already occurred and therefore cannot be the cause of your problems. So this is something that you would not need to include in a minimal, reproducible example.

Sorry i have not added the whole function but the pool function is part of the map_builder so basically on a failed image build it runs the whole function again, over and over, i need to add some maximum retries to this but as i said ive been capturing the individual images into folders and there are all present and accounted for but i was getting random black squares where the paste operation has failed.

Edit 4

while doing other server work and file copies i was getting samba crashes which is where my image files get saved, the samba server is in a VM, i have increased the resources available and i have had no crashes or bad images since, i am monitoring and will see how it goes.

Solution

Update

If I understand your post and comments correctly, you claim that you have no problem downloading all the requested images and you say it's only when you attempt to paste the downloaded image into an existing image that you get one or more exceptions. But I question this. The stack trace that you posted shows in part:

  File "/home/james/mapgrabber/./map-grabber.py", line 247, in map_builder
    image = Image.open(BytesIO(d[0]))
TypeError: 'NoneType' object is not subscriptable

This strongly suggests that d, the return value from url_downloader is None. But that function only returns None when the status code from attempting to download an image is not 200, that is, the download itself has failed.

Regardless of whether your problem is a failed download or a download that succeeded but could not be pasted, I strongly doubt that retrying the download and paste operations with the URL arguments that failed will succeed on a subsequent time. You need to determine why the download did not succeed (Erroneous URL?) or the paste examination failed (Corrupted image? Erroneous coordinates being used for pasting? Something else?). And if you need help in determining this you need to share the exceptions you are getting and the a minimal, reproducible example. Your code as posted does not make sense and raises some questions such as:

Why is map_id being passed to url_downloader when it does not seem to be used?
On a failed pasted operation, you execute map_builder(map_id, save_path, conn). This is meaningless to us readers since neither the function nor its arguments are defined anywhere. Also, this is executed after an exception has already occurred and therefore cannot be the cause of your problems. So this is something that you would not need to include in a minimal, reproducible example.

If you did want to retry failed operations some number of times by collecting the images that could not be downloaded or pasted, the the following code is how I would do it. But again, I don't expect it to have any positive effect other than indicating to you more definitively what is failing. For that purpose you should set N_ATTEMPTS to 1:

import multiprocessing
import requests
import traceback

def url_downloader(map_id, url):
    header = {"User-Agent": "Mozilla/5.0 (X11; CrOS "
                            "x86_64 12871.102.0) "
                            "AppleWebKit/537.36 (KHTML, "
                            "like Gecko) "
                            "Chrome/81.0.4044.141 "
                            "Safari/537.36"}

    try:
        response = requests.get(url[0], headers=header)
        response.raise_for_status() # Check for an error
    except Exception as e:
        print('Download failed:', e, flush=True)
        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))
        return None
    else:
        image_data = response.content
        return [image_data, url[1], url[2]]

# The number of times we will attempt to download and process images:      
N_ATTEMPTS = 2

def main():
    ... # Missing code omitted
    with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
        func = partial(url_downloader, map_id)
        for attempt in range(N_ATTEMPTS):
            data = pool.map(func, url_list)
        
            # Urls of images that did not download:
            failed_download_arguments = []
            
            # Urls of images that downloaded but could not be pasted:
            failed_processing_arguments = []
            
            for index, d in enumerate(data):
                if d is None:
                    # Unable to download image
                    failed_download_arguments.append(url_list[index])
                else:
                    try:
                        image = Image.open(BytesIO(d[0]))
                        dst.paste(image, (d[1], d[2]))
                        image.close()
                    except Exception as e:
                        # Unable to pasted downloaded image:
                        print('Image processing failed:', e)
                        traceback.print_exc()
                        failed_processing_arguments.append(url_list[index])
                        helpers.write_log(os.getcwd(), '{} : {}'.format(map_id, e))    
                        map_builder(map_id, save_path, conn) # What's this?
        
            # If there are no problems downloading images, then we
            # expect failed_download_arguments to be empty:
            if failed_download_arguments:
                print('Failed to download:', failed_download_arguments)
            if failed_processing_arguments:
                print('Failed to process:', failed_processing_arguments)
    
            # Set new url_list with the failed arguments:
            url_list = failed_download_arguments + failed_processing_arguments
            if not url_list:
                # Nothing to retry:
                print('All images successfully downloaded and processed.')
                break
        
if __name__ == '__main__':
    main()

As an aside: Looking at your worker function, url_downloader, which is only downloading a URL, it seems that multithreading would be more appropriate.