pythoncluster-analysishierarchical-clusteringhamming-distancephash

How to create and save pairwise hamming distances of image perceptual hashes for imput to clustering algorithm


hoping somebody can provide guidance as to how to compute pairwise hamming distance of a bunch of hashes and then cluster them. I don't care so much as to performance as from looking at what I am doing and what I want to do its going to be slow no matter what and its not something that will be repeatedly run over and over.

So...in a nutshell I had mistakenly erased 1000's of photos off a drive and had no backups (I know...bad practise). Using various tools I was able to recover a very high % of them from the drive but was left with hundreds of 1000's of photos. Due to the techniques used for recovering some of the photos (such as file carving) some of the images are corrupt to various degrees, others were identical copies, and yet others were essentially identical visually but byte for byte were different.

What I am looking at doing to help the situation is the following:

  1. check each image and identify if image file is structurally corrupt or not (done)
  2. generate perceptual hashes (fingerprint) for each image so that images can be compared for similarity and clustered (fingerprinting part is done)
  3. calculate the pairwise distance of the fingerprints
  4. cluster the pairwise distance so that similar images can be viewed together to aide manual cleanup

In the script attached you will notice a couple of places I calculate hashes, I will explain as to not cause confusion...

What I need guidance on is how to accomplish the following:

  1. take the three hashes I have for each image and calculate hamming pairwise distances
  2. for each image comparison keep only the hamming distance that is most similar
  3. feed the results into scipy hierarchical clustering so that I can group similar images

I am just learning Python so that is part of the my challenge... I think from what I have got from google I can do this by first getting the pairwise distances using scipy.spatial.distance.pdist, then process this to keep the most similar distance for each image comparison, then feed this to a scipy clustering function. But I cannot figure how to organize this and provide things in the proper format etc. Can anyone provide some guidance on this?

Here is my current script for reference in case anyone else finds it interesting that I will need to alter for storing some sort of dictionary of hashes or maybe some sort of on disk storage.

from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call

# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True

# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')

devnull = open(os.devnull, 'w')

corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
for root, dirs, files in os.walk(sys.argv[1]):
    for filename in files:
        ext = os.path.splitext(filename.lower())[1]
        filepath = os.path.join(root, filename)
        if ext in (stdimageext + rawimageext):
            hashes = [None] * 4
            print(filename)
            # reset corrupt string
            corrupt_str = None
            if ext in (stdimageext):
                metadata = pyexiv2.ImageMetadata(filepath)
                metadata.read()
                rotate = 0
                try:
                    im = Image.open(filepath)
                except:
                    None
                else:
                    for x in range(3):
                        hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)

                # use jpeginfo against all jpg images as its pretty accurate
                if ext in ('.jpg','.jpeg'):
                    rc = 0
                    rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
                    if rc == 1:
                        corrupt_str = 'JpegInfo'

                if corrupt_str is None:
                    try:
                        im = Image.open(filepath)
                        im.verify()
                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'PIL_Verify'
                    else:
                        try:
                            im = Image.open(filepath)
                            im.load()
                        except:
                            e =  sys.exc_info()[0]
                            corrupt_str = 'PIL_Load'

            # raw image processing
            else:
                # extract largest embedded preview image first
                metadata_orig = pyexiv2.ImageMetadata(filepath)
                metadata_orig.read()
                if len(metadata_orig.previews) > 0:
                    preview = metadata_orig.previews[-1]

                    # save preview to temp file
                    temp_preview = NamedTemporaryFile()
                    preview.write_to_file(temp_preview.name)
                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                    rotate = 0
                    try:
                        im = Image.open(temp_preview.name)
                    except:
                        None
                    else:
                        for x in range(4):
                            hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
                    # close temp file
                    temp_preview.close()

                # try to load raw using libraw via rawpy first, 
                # generally if libraw can't load it then ufraw extraction would also fail
                try:
                    with rawpy.imread(filepath) as im:
                        None
                except:
                    e = sys.exc_info()[0]
                    corrupt_str = 'Libraw_Load'

                else:
                    # as a final last ditch effort compare perceptual hashes of extracted 
                    # raw and embedded preview to detect possible internal corruption 

                    if len(metadata_orig.previews) > 0:
                        # extract and convert raw to jpeg image using ufraw
                        temp_raw = NamedTemporaryFile(suffix='.jpg')

                        try:
                            check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)

                        except:
                            e = sys.exc_info()[0]
                            corrupt_str = 'Ufraw-conv'

                        else:
                            rhash = imagehash.dhash(Image.open(temp_raw.name),32)

                            # compare preview with raw image and compute the most similar hamming distance (best)
                            hamdiff = .0
                            for h in range(4):
                                # calculate hamming distance to compare similarity
                                hamdiff = max((256 - sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(str(hashes[h]), str(rhash))))/256,hamdiff)

                            if hamdiff < .7: # raw file is probably corrupt
                                corrupt_str = 'hash' + str(round(hamdiff*100,2))
                        # close temp files
                        temp_raw.close()
                        print(hamdiff)
                        print(rhash)

            print(hashes[0])
            print(hashes[1])
            print(hashes[2])
            print(hashes[3])

            # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
            mo = corruptRegex.search(filename)
            if corrupt_str is not None:
                if mo is not None:
                    os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext))
                else:
                    os.rename(filepath,os.path.join(root, os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext))
            else:
                if mo is not None:
                    os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '', filename) + ext))

EDITED Just want to provide an update with what I came up with in the end that seems to work quite nicely for my intended purpose and maybe it will prove useful for other users in a similar situation. The script can still use some polishing but otherwise all the meat is there. As I am green with respect using Python if anyone see something that can improved greatly please let me know.

The script does the following:

  1. attempts to detect image corruption in terms of file structure using various methods. For raw image formats (NEF, DNG, TIF) sometimes I found that a corrupt image could still load fine so I decided to hash both the preview image and an extracted .jpg of the raw image and compare the hashes and if they were not similar enough I assume that image is corrupted in some form.
  2. create perceptual hashes for each image that could be loaded. Three are created for the base file (original, original rotated 90, original rotated 180). In addition, for raw images an additional 3 hashes were created for the extracted preview image, this was done so that in cases where the raw image was corrupted we would still have hashes based on the full image (assuming the preview is fine).
  3. for images that are identified as corrupt they are renamed with a suffix that indicates that are corrupt and what determined it.
  4. Pairwise hamming distances were computed by comparing hashes against all file pairs and stored in a numpy array.
  5. square form of pairwise distances are fed to fastcluster for clustering
  6. output from fastcluster is used to generate a dendrogram to visualize clusters of similar images

I save the numpy array to disk so that I can later rerun the fastcluster/dendrogram part without recomputing the hashes for each file which is slow. This is something I have to alter the script to allow yet....

from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
import fastcluster
import matplotlib.pyplot as plt

# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True

# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')

devnull = open(os.devnull, 'w')

corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')

hashes = []
filelist = []

for root, _, files in os.walk(sys.argv[1]):
    for filename in files:
        ext = os.path.splitext(filename.lower())[1]
        relpath = os.path.relpath(root, sys.argv[1])
        filepath = os.path.join(root, filename)
        if ext in (stdimageext + rawimageext):
            hashes_tmp = []
            rhash = []
            # reset corrupt string
            corrupt_str = None
            if ext in (stdimageext):
                try:
                    im=Image.open(filepath)
                    for x in range(3):
                        hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x, expand=1),32)))
                except:
                    None

                # use jpeginfo against all jpg images as its pretty accurate
                if ext in ('.jpg','.jpeg'):
                    rc = 0
                    rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
                    if rc == 1:
                        corrupt_str = 'JpegInfo'

                if corrupt_str is None:
                    try:
                        im = Image.open(filepath)
                        im.verify()
                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'PIL_Verify'
                    else:
                        try:
                            im = Image.open(filepath)
                            im.load()
                        except:
                            e =  sys.exc_info()[0]
                            corrupt_str = 'PIL_Load'

            # raw image processing
            if ext in (rawimageext):
                # extract largest embedded preview image first
                metadata_orig = pyexiv2.ImageMetadata(filepath)
                metadata_orig.read()
                if len(metadata_orig.previews) > 0:
                    preview = metadata_orig.previews[-1]

                    # save preview to temp file
                    temp_preview = NamedTemporaryFile()
                    preview.write_to_file(temp_preview.name)
                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                    try:
                        im = Image.open(temp_preview.name)
                        for x in range(3):
                            hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
                    except:
                        None


                # try to load raw using libraw via rawpy first, 
                # generally if libraw can't load it then ufraw extraction would also fail
                try:
                    im = rawpy.imread(filepath)
                except:
                    e = sys.exc_info()[0]
                    corrupt_str = 'Libraw_Load'

                else:
                    # as a final last ditch effort compare perceptual hashes of extracted 
                    # raw and embedded preview to detect possible internal corruption 

                    # extract and convert raw to jpeg image using ufraw
                    temp_raw = NamedTemporaryFile(suffix='.jpg')

                    try:
                        check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)

                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'Ufraw-conv'

                    else:
                        try:
                            im = Image.open(temp_raw.name)
                            for x in range(3):
                                rhash.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
                        except:
                            None

                # compare preview with raw image and compute the most similar hamming distance (best)
                if len(hashes_tmp) > 0 and len(rhash) > 0:
                    hamdiff = 1
                    for rh in rhash:
                        # calculate hamming distance to compare similarity
                        hamdiff = min(hamdiff,(sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(hashes_tmp[0], rh))/len(hashes_tmp[0])))

                        if hamdiff > .3: # raw file is probably corrupt
                            corrupt_str = 'hash' + str(round(hamdiff*100,2))

                hashes_tmp = hashes_tmp + rhash

            # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
            mo = corruptRegex.search(filename)
            newfilename = None
            if corrupt_str is not None:
                if mo is not None:
                    newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext
                else:
                    newfilename = os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext
            else:
                if mo is not None:
                    newfilename = re.sub(corruptRegex, '', filename) + ext

            if newfilename is not None:
                os.rename(filepath,os.path.join(root, newfilename))

            if len(hashes_tmp) > 0:
                hashes.append(hashes_tmp)
                if newfilename is not None:
                    filelist.append(os.path.join(relpath, newfilename))
                else:
                    filelist.append(os.path.join(relpath, filename))

print(len(filelist))
print(len(hashes))

a = np.empty(shape=(len(filelist),len(filelist)))

for hash_idx1, hash in enumerate(hashes):
    a[hash_idx1,hash_idx1] = 0
    hash_idx2 = hash_idx1 + 1
    while hash_idx2 < len(hashes):
        ham_dist = 1
        for h1 in hash:
            for h2 in hashes[hash_idx2]:
                ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
        a[hash_idx1,hash_idx2] = ham_dist
        a[hash_idx2,hash_idx1] = ham_dist
        hash_idx2 = hash_idx2 + 1

print(a)

X = squareform(a)
print(X)

linkage = fastcluster.single(X)
clustdict = {i:[i] for i in range(len(linkage)+1)}
fig = plt.figure(figsize=(25,25))
plt.title('test title')
plt.xlabel('perpetual hash hamming distance')

plt.axvline(x=.15,c='red',linestyle='--')
dg = dendrogram(linkage, labels=filelist, orientation='right', show_leaf_counts=True)
ax = fig.gca()
ax.set_xlim(-.01,ax.get_xlim()[1])
plt.show
plt.savefig('foo1.pdf', bbox_inches='tight', dpi=100)

with open('numpyarray.npy','wb') as f:
    np.save(f,a)

Solution

  • It took awhile...but I figured things out eventually and got a script that does a pretty good job of identifying if an image is corrupt and then uses perceptual hashes to try and group similar images together.

    from PIL import Image, ImageFile
    import os, sys, imagehash, pyexiv2, rawpy, re
    from tempfile import NamedTemporaryFile
    from subprocess import Popen, PIPE
    import shlex
    import numpy as np
    from scipy.cluster.hierarchy import dendrogram, fcluster
    from scipy.spatial.distance import squareform
    import fastcluster
    #import matplotlib.pyplot as plt
    import math
    import string
    from wand.image import Image as wImage
    import wand.exceptions
    from io import BytesIO
    from datetime import datetime
    #import fd_table_status
    
    def redirect_stdout():
        print("Redirecting stdout and stderr")
        sys.stdout.flush() # <--- important when redirecting to files
        sys.stderr.flush()
        newstdout = os.dup(1)
        newstderr = os.dup(2)
        devnull = os.open(os.devnull, os.O_WRONLY)
        devnull2 = os.open(os.devnull, os.O_WRONLY)
        os.dup2(devnull, 1)
        os.dup2(devnull2,2)
        os.close(devnull)
        os.close(devnull2)
        sys.stdout = os.fdopen(newstdout, 'w')
        sys.stderr = os.fdopen(newstderr, 'w')
    
    redirect_stdout()
    
    def ct(linkage_matrix,flist,score):
        cluster_id = []
        for fidx, file_ in enumerate(flist):
            link_ = np.where(linkage_matrix[:,:2] == fidx)[0]
            if len(link_) == 1:
                link = link_[0]
                if linkage_matrix[link][2] <= score:
                    fcluster_idx = str(link).zfill(len(str(len(linkage_matrix))))
                    while True:
                        match = np.where(linkage_matrix[:,:2] == link+1+len(linkage_matrix))[0]
                        if len(match) == 1:
                            link = match[0]
                            link_d = linkage_matrix[link]
                            if link_d[2] <= score:
                                fcluster_idx = str(match[0]).zfill(len(str(len(linkage_matrix)))) + fcluster_idx
                            else:
                                break
                        else:
                            break
                else:
                    fcluster_idx = None
    
                cluster_id.append(fcluster_idx)
    
        return cluster_id
    
    def get_exitcode_stdout_stderr(cmd):
        """
        Execute the external command and get its exitcode, stdout and stderr.
        """
        args = shlex.split(cmd)
    
        proc = Popen(args, stdout=PIPE, stderr=PIPE, close_fds=True)
        out, err = proc.communicate()
        exitcode = proc.returncode
    
        del proc
    
        return exitcode, out, err
    
    if os.path.isdir(sys.argv[1]):
        start_time = datetime.now()
        # allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
        ImageFile.LOAD_TRUNCATED_IMAGES = True
    
        # image files this script will handle
        # PIL supported image formats
        stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
        # libraw/ufraw supported formats
        rawimageext = ('.nef', '.dng', '.tif', '.tiff')
    
        corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
        groupRegex = re.compile(r'^\[\d+\]_')
        ufrawRegex = re.compile(r'Corrupt data near|Unexpected end of file|has the wrong dimensions!|Cannot open file|Cannot decode file|requests a nonexistent image!')
    
        for subdirs,dirs,files in os.walk(sys.argv[1]):
            files.clear()
            dirs.clear()
            for root,_,files in os.walk(subdirs):
                print('\n******** Processing files in ' + root)
                hashes = []
                w_hash = []
                w_hash_idx = []
                filelist = []
                files_ = []
                cnt = 0
                for f in files:
                    #cnt = cnt + 1
                    #if cnt < 10:
                    files_.append(f)
                    continue
                cnt = 0
    
                for f_idx, fname in enumerate(files_):
                    e=None
                    ext = os.path.splitext(fname.lower())[1]
                    filepath = os.path.join(root, fname)
    
                    imformat = ''
                    hashes_tmp = []
    
                    # reset corrupt string
                    corrupt_str = None
    
                    if ext in (stdimageext + rawimageext):
                        print(str(int(round(((f_idx+1)/len(files_))*100))) + '%' + ' : ' + fname + '....', end='', flush=True)
                        try:
                            with wImage(filename=filepath) as im:
                                imformat = '.' + im.format.lower()
                                ext = imformat if imformat is not '' else ext
                                with im.convert('jpeg') as converted:
                                    jpeg_bin = converted.make_blob()
                                    with Image.open(BytesIO(jpeg_bin)) as im2:
                                        hash_image = []
                                        for x in range(3):
                                            print('.',end='',flush=True)
                                            hash_i = str(imagehash.dhash(im2.rotate(90 * x, expand=1),32))
                                            if ''.join(set(hash_i)) != '0':
                                                hash_image.append(hash_i)
                                        if hash_image:
                                            hash_image.append(1)
                                            hashes_tmp.append(hash_image)
                        except:
                            e = sys.exc_info()[0]
                            errcode = str([k for k, v in wand.exceptions.TYPE_MAP.items() if v == e][0]).zfill(3)
                            if int(errcode[-2:]) in (15,25,30,35,40,50,55):
                                corrupt_str = 'magick'
                        finally:
                            try:
                                im.close()
                            except:
                                pass
                            try:
                                im2.close()
                            except:
                                pass
    
                        if ext in (stdimageext):
                            try:
                                with Image.open(filepath) as im:
                                    hash_image = []
                                    for x in range(3):
                                        print('.',end='',flush=True)
                                        hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                        if ''.join(set(hash_i)) != '0':
                                            hash_image.append(hash_i)
                                    if hash_image:
                                        hash_image.append(2)
                                        hashes_tmp.append(hash_image)
                            except:
                                pass
                            finally:
                                try:
                                    im.close()
                                except:
                                    pass
    
                            # use jpeginfo against all jpg images as its pretty accurate
                            if ext in ('.jpg','.jpeg'):
                                #rc = 0
                                print('.',end='',flush=True)
                                cmd = 'jpeginfo --check "' + filepath + '"'
                                exitcode, out, err = get_exitcode_stdout_stderr(cmd)
                                #rc = call(["jpeginfo", "--check", filepath], stdout=DEVNULL, stderr=DEVNULL, close_fds=True)
                                if exitcode == 1:
                                    corrupt_str = 'JpegInfo' if corrupt_str == None else corrupt_str
                                #del rc
    
                            if corrupt_str is None:
                                try:
                                    with Image.open(filepath) as im:
                                        print('.',end='',flush=True)
                                        im.verify()
                                except:
                                    e = sys.exc_info()[0]
                                    corrupt_str = 'PIL_Verify' if corrupt_str == None else corrupt_str
                                else:
                                    try:
                                        with Image.open(filepath) as im:
                                            print('.',end='',flush=True)
                                            temp = im.copy()
                                            im.load()
                                    except:
                                        e =  sys.exc_info()[0]
                                        corrupt_str = 'PIL_Load' if corrupt_str == None else corrupt_str
                                    finally:
                                        try:
                                            temp.close()
                                        except:
                                            pass
                                        try:
                                            im.close()
                                        except:
                                            pass
                                finally:
                                    try:
                                        im.close()
                                    except:
                                        pass
                                    try:
                                        temp.close()
                                    except:
                                        pass
    
                        # raw image processing
                        if ext in (rawimageext):
                            print('.',end='',flush=True)
                            # try to load raw using libraw via rawpy first, 
                            # generally if libraw can't load it then ufraw extraction would also fail
                            if corrupt_str == None:
                                try:
                                    with rawpy.imread(filepath) as raw:
                                        rgb = raw.postprocess(use_camera_wb=True)
                                        temp_raw = NamedTemporaryFile(suffix='.jpg')
                                        Image.fromarray(rgb).save(temp_raw.name)
                                        with Image.open(temp_raw.name) as im:
                                            hash_image = []
                                            for x in range(3):
                                                print('.',end='',flush=True)
                                                hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                if ''.join(set(hash_i)) != '0':
                                                    hash_image.append(hash_i)
                                            if hash_image:
                                                hash_image.append(3)
                                                hashes_tmp.append(hash_image)
    
                                except(rawpy.LibRawFatalError):
                                    e = sys.exc_info()[1]
                                    corrupt_str = 'Libraw_FE'
                                except(rawpy.LibRawNonFatalError):
                                    e = sys.exc_info()[1]
                                    corrupt_str = 'Libraw_NFE'
                                except:
                                    #print(sys.exc_info())
                                    corrupt_str = 'Libraw'
    
                                finally:
                                    try:
                                        im.close()
                                    except:
                                        pass
                                    try:
                                        temp_raw.close()
                                    except:
                                        pass
                                    try:
                                        raw.close()
                                    except:
                                        pass
                                if corrupt_str == None:
                                    # as a final last ditch effort compare perceptual hashes of extracted 
                                    # raw and embedded preview to detect possible internal corruption 
    
                                    # extract and convert raw to jpeg image using ufraw
                                    temp_raw = NamedTemporaryFile(suffix='.jpg')
                                    #rc = 0
                                    cmd = 'ufraw-batch --wb=camera --rotate=camera --out-type=jpg --compression=95 --noexif --lensfun=none --auto-crop --output=' + temp_raw.name + ' --overwrite "' + filepath + '"'
                                    print('.',end='',flush=True)
                                    exitcode, out, err = get_exitcode_stdout_stderr(cmd)
                                    if exitcode == 1 or ufrawRegex.search(str(err)) is not None:
                                        corrupt_str = 'Ufraw' if corrupt_str is None else corrupt_str
    
                                    tmpfilesize = os.stat(temp_raw.name).st_size
                                    if tmpfilesize > 0:
                                        try:
                                            with Image.open(temp_raw.name) as im:
                                                hash_image = []
                                                for x in range(3):
                                                    print('.',end='',flush=True)
                                                    hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                    if ''.join(set(hash_i)) != '0':
                                                        hash_image.append(hash_i)
                                                if hash_image:
                                                    hash_image.append(4)
                                                    hashes_tmp.append(hash_image)
                                        except:
                                            pass
                                        finally:
                                            try:
                                                im.close()
                                            except:
                                                pass
                                    try:
                                        temp_raw.close()
                                    except:
                                        pass
    
    
                            # attempt to extract preview images
                            imfile = filepath
                            try:
                                with pyexiv2.ImageMetadata(imfile) as metadata_orig:
                                    metadata_orig.read()
                                    #for i,p in enumerate(metadata_orig.previews):
                                    if metadata_orig.previews:
                                        preview = metadata_orig.previews[-1]
                                        # save preview to temp file
                                        temp_preview = NamedTemporaryFile()
                                        preview.write_to_file(temp_preview.name)
                                        os.rename(temp_preview.name + preview.extension, temp_preview.name)
    
                                        try:
                                            with Image.open(temp_preview.name) as im:
                                                hash_image = []
                                                for x in range(3):
                                                    print('.',end='',flush=True)
                                                    hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                    if ''.join(set(hash_i)) != '0':
                                                        hash_image.append(hash_i)
                                                if hash_image:
                                                    hash_image.append(5)
                                                    hashes_tmp.append(hash_image)
                                        except:
                                            pass
                                        finally:
                                            try:
                                                temp_preview.close()
                                            except:
                                                pass
                                            try:
                                                im.close()
                                            except:
                                                pass
                            except:
                                pass
                            finally:
                                try:
                                    metadata_orig.close()
                                except:
                                    pass
    
                        # compare hashes for all images that were found or extracted and find most dissimilar hamming distance (worst)
                        if len(hashes_tmp) > 1:
                            #print('checking_hashes')
                            print('.',end='',flush=True)
                            scores = []
    
                            for h_idx, hash in enumerate(hashes_tmp):
                                i = h_idx + 1
                                while i < len(hashes_tmp):
                                    ham_dist = 1
                                    for h1 in hash[:-1]:
                                        for h2 in hashes_tmp[i][:-1]:
                                            ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
                                    if (hash[-1] == 5 and hashes_tmp[i][-1] != 5) or (hash[-1] != 5 and hashes_tmp[i][-1] == 5):
                                        scores.append([ham_dist,hash[-1],hashes_tmp[i][-1]])
                                    i = i + 1
                            if scores:
                                worst = sorted(scores, key = lambda x: x[0])[-1]
    
                                if worst[0] > 0.3:
                                    worst1 = str(worst[1])
                                    worst2 = str(worst[2])
                                    corrupt_str = 'hash' + str(round(worst[0]*100,2)) + '_' + worst1 + '-' + worst2 if corrupt_str == None else corrupt_str
    
                        # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
                        mo = corruptRegex.search(fname)
                        newfilename = None
                        if corrupt_str is not None:
                            print('Corrupt: ' + corrupt_str)
                            if mo is not None:
                                newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', fname) + ext
                            else:
                                newfilename = os.path.splitext(fname)[0] + '_[' + corrupt_str + ']' + ext
                        else:
                            print('OK!')
                            if mo is not None:
                                newfilename = re.sub(corruptRegex, '', fname) + ext
    
                        # remove group index from name if present, this will be assigned in the next step if needed
                        newfilename = newfilename if newfilename is not None else fname
                        mo = groupRegex.search(newfilename)
                        if mo is not None:
                            newfilename = re.sub(groupRegex, '', newfilename)
    
                        if hashes_tmp:
                            # set function unduplicates flattened list
                            hashes.append(set([item for sublist in hashes_tmp for item in sublist[:-1]]))
    
                        filelist.append([root,fname,newfilename, len(hashes_tmp)])
    
    
                print('******** Grouping similar images... ************')
                if len(hashes) > 1:
                    scores = []
                    for h_idx, hash in enumerate(hashes):
                        i = h_idx + 1
                        while i < len(hashes):
                            ham_dist = 1
                            for h1 in hash:
                                for h2 in hashes[i]:
                                    ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
                            scores.append(ham_dist)
                            i = i + 1
                    X = np.array(scores)
    
                    linkage = fastcluster.single(X)
                    w_hash_idx = [el_idx for el_idx, el in enumerate(filelist) if el[3] > 0]
                    w_hash = [filelist[i] for i in w_hash_idx]
    
                    test=ct(linkage,[el[2] for el in w_hash],.2)
                    for i, prfx in enumerate(test):
                        curfilename = w_hash[i][2]
    
                        mo = groupRegex.search(curfilename)
                        newfilename = None
    
                        if prfx is not None:
                            if mo is not None:
                                newfilename = re.sub(groupRegex, '[' + prfx + ']_', curfilename)
                            else:
                                newfilename = '[' + prfx + ']_' + curfilename
                        else:
                            if mo is not None:
                                newfilename = re.sub(groupRegex, '', curfilename)
    
                    #    if newfilename is not None:
                        filelist[w_hash_idx[i]][2] = newfilename if newfilename is not None else curfilename
    
                    #fig = plt.figure(figsize=(25,25))
                    #plt.title(root)
                    #plt.xlabel('perpetual hash hamming distance')
    
                    #plt.axvline(x=.15,c='red',linestyle='--')
                    #dg = dendrogram(linkage, labels=[el[2] for el in w_hash], orientation='right', show_leaf_counts=True)
                    #ax = fig.gca()
                    #ax.set_xlim(-.02,ax.get_xlim()[1])
                    #plt.show
                    #plt.savefig(os.path.join(root,'dendrogram.pdf'), bbox_inches='tight', dpi=100)
                    w_hash.clear()
                    w_hash_idx.clear()
                print('******** Renameing file if applicable... ************')
                for fr in filelist:
                    if fr[1] != fr[2]:
                        #print(fr[1] + ' -- ' + fr[2])
                        path = fr[0]
                        os.rename(os.path.join(path,fr[1]),os.path.join(path,fr[2]))
    
    
                filelist.clear()
    
        duration = datetime.now() - start_time
        days    = divmod(duration.total_seconds(), 86400)        # Get days (without [0]!)
        hours   = divmod(days[1], 3600)               # Use remainder of days to calc hours
        minutes = divmod(hours[1], 60)                # Use remainder of hours to calc minutes
        seconds = divmod(minutes[1], 1)               # Use remainder of minutes to calc seconds
        print("Time to complete: %d days, %d:%d:%d" % (days[0], hours[0], minutes[0], seconds[0]))