python cluster-analysis hierarchical-clustering hamming-distance phash

How to create and save pairwise hamming distances of image perceptual hashes for imput to clustering algorithm

hoping somebody can provide guidance as to how to compute pairwise hamming distance of a bunch of hashes and then cluster them. I don't care so much as to performance as from looking at what I am doing and what I want to do its going to be slow no matter what and its not something that will be repeatedly run over and over.

So...in a nutshell I had mistakenly erased 1000's of photos off a drive and had no backups (I know...bad practise). Using various tools I was able to recover a very high % of them from the drive but was left with hundreds of 1000's of photos. Due to the techniques used for recovering some of the photos (such as file carving) some of the images are corrupt to various degrees, others were identical copies, and yet others were essentially identical visually but byte for byte were different.

What I am looking at doing to help the situation is the following:

check each image and identify if image file is structurally corrupt or not (done)
generate perceptual hashes (fingerprint) for each image so that images can be compared for similarity and clustered (fingerprinting part is done)
calculate the pairwise distance of the fingerprints
cluster the pairwise distance so that similar images can be viewed together to aide manual cleanup

In the script attached you will notice a couple of places I calculate hashes, I will explain as to not cause confusion...

for images that are supported by PIL I generate three hashes, 1st for original image, 2nd is rotated 90 degrees, and 3rd is rotated 180 degrees. This was done so that when the pairwise calculations are done I can account for images that just vary in orientation.
for raw images not supported by PIL I instead favour hashes that are generated from the extracted embedded preview image. I did this instead of using the raw image because in the case of a corrupt raw image file there was a high chance of probability that the preview image was intact due to its smaller size and thus would be better for identifying if the image is similar to others
other place hashes are generated is during a last ditch effort to identify corrupt raw images. I would compare hashes of the extracted/converted raw image to that of the extracted embedded preview image and if the similarity does not meet a defined threshold it is assumed that there is probably corruption of the raw file as a whole.

What I need guidance on is how to accomplish the following:

take the three hashes I have for each image and calculate hamming pairwise distances
for each image comparison keep only the hamming distance that is most similar
feed the results into scipy hierarchical clustering so that I can group similar images

I am just learning Python so that is part of the my challenge... I think from what I have got from google I can do this by first getting the pairwise distances using scipy.spatial.distance.pdist, then process this to keep the most similar distance for each image comparison, then feed this to a scipy clustering function. But I cannot figure how to organize this and provide things in the proper format etc. Can anyone provide some guidance on this?

Here is my current script for reference in case anyone else finds it interesting that I will need to alter for storing some sort of dictionary of hashes or maybe some sort of on disk storage.

from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call

# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True

# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')

devnull = open(os.devnull, 'w')

corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
for root, dirs, files in os.walk(sys.argv[1]):
    for filename in files:
        ext = os.path.splitext(filename.lower())[1]
        filepath = os.path.join(root, filename)
        if ext in (stdimageext + rawimageext):
            hashes = [None] * 4
            print(filename)
            # reset corrupt string
            corrupt_str = None
            if ext in (stdimageext):
                metadata = pyexiv2.ImageMetadata(filepath)
                metadata.read()
                rotate = 0
                try:
                    im = Image.open(filepath)
                except:
                    None
                else:
                    for x in range(3):
                        hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)

                # use jpeginfo against all jpg images as its pretty accurate
                if ext in ('.jpg','.jpeg'):
                    rc = 0
                    rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
                    if rc == 1:
                        corrupt_str = 'JpegInfo'

                if corrupt_str is None:
                    try:
                        im = Image.open(filepath)
                        im.verify()
                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'PIL_Verify'
                    else:
                        try:
                            im = Image.open(filepath)
                            im.load()
                        except:
                            e =  sys.exc_info()[0]
                            corrupt_str = 'PIL_Load'

            # raw image processing
            else:
                # extract largest embedded preview image first
                metadata_orig = pyexiv2.ImageMetadata(filepath)
                metadata_orig.read()
                if len(metadata_orig.previews) > 0:
                    preview = metadata_orig.previews[-1]

                    # save preview to temp file
                    temp_preview = NamedTemporaryFile()
                    preview.write_to_file(temp_preview.name)
                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                    rotate = 0
                    try:
                        im = Image.open(temp_preview.name)
                    except:
                        None
                    else:
                        for x in range(4):
                            hashes[x] = imagehash.dhash(im.rotate(90 * (x + 1)),32)
                    # close temp file
                    temp_preview.close()

                # try to load raw using libraw via rawpy first, 
                # generally if libraw can't load it then ufraw extraction would also fail
                try:
                    with rawpy.imread(filepath) as im:
                        None
                except:
                    e = sys.exc_info()[0]
                    corrupt_str = 'Libraw_Load'

                else:
                    # as a final last ditch effort compare perceptual hashes of extracted 
                    # raw and embedded preview to detect possible internal corruption 

                    if len(metadata_orig.previews) > 0:
                        # extract and convert raw to jpeg image using ufraw
                        temp_raw = NamedTemporaryFile(suffix='.jpg')

                        try:
                            check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)

                        except:
                            e = sys.exc_info()[0]
                            corrupt_str = 'Ufraw-conv'

                        else:
                            rhash = imagehash.dhash(Image.open(temp_raw.name),32)

                            # compare preview with raw image and compute the most similar hamming distance (best)
                            hamdiff = .0
                            for h in range(4):
                                # calculate hamming distance to compare similarity
                                hamdiff = max((256 - sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(str(hashes[h]), str(rhash))))/256,hamdiff)

                            if hamdiff < .7: # raw file is probably corrupt
                                corrupt_str = 'hash' + str(round(hamdiff*100,2))
                        # close temp files
                        temp_raw.close()
                        print(hamdiff)
                        print(rhash)

            print(hashes[0])
            print(hashes[1])
            print(hashes[2])
            print(hashes[3])

            # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
            mo = corruptRegex.search(filename)
            if corrupt_str is not None:
                if mo is not None:
                    os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext))
                else:
                    os.rename(filepath,os.path.join(root, os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext))
            else:
                if mo is not None:
                    os.rename(filepath,os.path.join(root, re.sub(corruptRegex, '', filename) + ext))

EDITED Just want to provide an update with what I came up with in the end that seems to work quite nicely for my intended purpose and maybe it will prove useful for other users in a similar situation. The script can still use some polishing but otherwise all the meat is there. As I am green with respect using Python if anyone see something that can improved greatly please let me know.

The script does the following:

attempts to detect image corruption in terms of file structure using various methods. For raw image formats (NEF, DNG, TIF) sometimes I found that a corrupt image could still load fine so I decided to hash both the preview image and an extracted .jpg of the raw image and compare the hashes and if they were not similar enough I assume that image is corrupted in some form.
create perceptual hashes for each image that could be loaded. Three are created for the base file (original, original rotated 90, original rotated 180). In addition, for raw images an additional 3 hashes were created for the extracted preview image, this was done so that in cases where the raw image was corrupted we would still have hashes based on the full image (assuming the preview is fine).
for images that are identified as corrupt they are renamed with a suffix that indicates that are corrupt and what determined it.
Pairwise hamming distances were computed by comparing hashes against all file pairs and stored in a numpy array.
square form of pairwise distances are fed to fastcluster for clustering
output from fastcluster is used to generate a dendrogram to visualize clusters of similar images

I save the numpy array to disk so that I can later rerun the fastcluster/dendrogram part without recomputing the hashes for each file which is slow. This is something I have to alter the script to allow yet....

from PIL import Image
from PIL import ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import check_call, call
import numpy as np
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial.distance import squareform
import fastcluster
import matplotlib.pyplot as plt

# allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
ImageFile.LOAD_TRUNCATED_IMAGES = True

# image files this script will handle
# PIL supported image formats
stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
# libraw/ufraw supported formats
rawimageext = ('.nef', '.dng', '.tif', '.tiff')

devnull = open(os.devnull, 'w')

corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')

hashes = []
filelist = []

for root, _, files in os.walk(sys.argv[1]):
    for filename in files:
        ext = os.path.splitext(filename.lower())[1]
        relpath = os.path.relpath(root, sys.argv[1])
        filepath = os.path.join(root, filename)
        if ext in (stdimageext + rawimageext):
            hashes_tmp = []
            rhash = []
            # reset corrupt string
            corrupt_str = None
            if ext in (stdimageext):
                try:
                    im=Image.open(filepath)
                    for x in range(3):
                        hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x, expand=1),32)))
                except:
                    None

                # use jpeginfo against all jpg images as its pretty accurate
                if ext in ('.jpg','.jpeg'):
                    rc = 0
                    rc = call(["jpeginfo", "--check", filepath], stdout=devnull, stderr=devnull)
                    if rc == 1:
                        corrupt_str = 'JpegInfo'

                if corrupt_str is None:
                    try:
                        im = Image.open(filepath)
                        im.verify()
                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'PIL_Verify'
                    else:
                        try:
                            im = Image.open(filepath)
                            im.load()
                        except:
                            e =  sys.exc_info()[0]
                            corrupt_str = 'PIL_Load'

            # raw image processing
            if ext in (rawimageext):
                # extract largest embedded preview image first
                metadata_orig = pyexiv2.ImageMetadata(filepath)
                metadata_orig.read()
                if len(metadata_orig.previews) > 0:
                    preview = metadata_orig.previews[-1]

                    # save preview to temp file
                    temp_preview = NamedTemporaryFile()
                    preview.write_to_file(temp_preview.name)
                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                    try:
                        im = Image.open(temp_preview.name)
                        for x in range(3):
                            hashes_tmp.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
                    except:
                        None


                # try to load raw using libraw via rawpy first, 
                # generally if libraw can't load it then ufraw extraction would also fail
                try:
                    im = rawpy.imread(filepath)
                except:
                    e = sys.exc_info()[0]
                    corrupt_str = 'Libraw_Load'

                else:
                    # as a final last ditch effort compare perceptual hashes of extracted 
                    # raw and embedded preview to detect possible internal corruption 

                    # extract and convert raw to jpeg image using ufraw
                    temp_raw = NamedTemporaryFile(suffix='.jpg')

                    try:
                        check_call(['ufraw-batch', '--wb=camera', '--rotate=camera', '--out-type=jpg', '--compression=95', '--noexif', '--lensfun=none', '--output=' + temp_raw.name, '--overwrite', '--silent', filepath],stdout=devnull, stderr=devnull)

                    except:
                        e = sys.exc_info()[0]
                        corrupt_str = 'Ufraw-conv'

                    else:
                        try:
                            im = Image.open(temp_raw.name)
                            for x in range(3):
                                rhash.append(str(imagehash.dhash(im.rotate(90 * x,expand=1),32)))
                        except:
                            None

                # compare preview with raw image and compute the most similar hamming distance (best)
                if len(hashes_tmp) > 0 and len(rhash) > 0:
                    hamdiff = 1
                    for rh in rhash:
                        # calculate hamming distance to compare similarity
                        hamdiff = min(hamdiff,(sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(hashes_tmp[0], rh))/len(hashes_tmp[0])))

                        if hamdiff > .3: # raw file is probably corrupt
                            corrupt_str = 'hash' + str(round(hamdiff*100,2))

                hashes_tmp = hashes_tmp + rhash

            # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
            mo = corruptRegex.search(filename)
            newfilename = None
            if corrupt_str is not None:
                if mo is not None:
                    newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', filename) + ext
                else:
                    newfilename = os.path.splitext(filename)[0] + '_[' + corrupt_str + ']' + ext
            else:
                if mo is not None:
                    newfilename = re.sub(corruptRegex, '', filename) + ext

            if newfilename is not None:
                os.rename(filepath,os.path.join(root, newfilename))

            if len(hashes_tmp) > 0:
                hashes.append(hashes_tmp)
                if newfilename is not None:
                    filelist.append(os.path.join(relpath, newfilename))
                else:
                    filelist.append(os.path.join(relpath, filename))

print(len(filelist))
print(len(hashes))

a = np.empty(shape=(len(filelist),len(filelist)))

for hash_idx1, hash in enumerate(hashes):
    a[hash_idx1,hash_idx1] = 0
    hash_idx2 = hash_idx1 + 1
    while hash_idx2 < len(hashes):
        ham_dist = 1
        for h1 in hash:
            for h2 in hashes[hash_idx2]:
                ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
        a[hash_idx1,hash_idx2] = ham_dist
        a[hash_idx2,hash_idx1] = ham_dist
        hash_idx2 = hash_idx2 + 1

print(a)

X = squareform(a)
print(X)

linkage = fastcluster.single(X)
clustdict = {i:[i] for i in range(len(linkage)+1)}
fig = plt.figure(figsize=(25,25))
plt.title('test title')
plt.xlabel('perpetual hash hamming distance')

plt.axvline(x=.15,c='red',linestyle='--')
dg = dendrogram(linkage, labels=filelist, orientation='right', show_leaf_counts=True)
ax = fig.gca()
ax.set_xlim(-.01,ax.get_xlim()[1])
plt.show
plt.savefig('foo1.pdf', bbox_inches='tight', dpi=100)

with open('numpyarray.npy','wb') as f:
    np.save(f,a)

Solution

It took awhile...but I figured things out eventually and got a script that does a pretty good job of identifying if an image is corrupt and then uses perceptual hashes to try and group similar images together.

from PIL import Image, ImageFile
import os, sys, imagehash, pyexiv2, rawpy, re
from tempfile import NamedTemporaryFile
from subprocess import Popen, PIPE
import shlex
import numpy as np
from scipy.cluster.hierarchy import dendrogram, fcluster
from scipy.spatial.distance import squareform
import fastcluster
#import matplotlib.pyplot as plt
import math
import string
from wand.image import Image as wImage
import wand.exceptions
from io import BytesIO
from datetime import datetime
#import fd_table_status

def redirect_stdout():
    print("Redirecting stdout and stderr")
    sys.stdout.flush() # <--- important when redirecting to files
    sys.stderr.flush()
    newstdout = os.dup(1)
    newstderr = os.dup(2)
    devnull = os.open(os.devnull, os.O_WRONLY)
    devnull2 = os.open(os.devnull, os.O_WRONLY)
    os.dup2(devnull, 1)
    os.dup2(devnull2,2)
    os.close(devnull)
    os.close(devnull2)
    sys.stdout = os.fdopen(newstdout, 'w')
    sys.stderr = os.fdopen(newstderr, 'w')

redirect_stdout()

def ct(linkage_matrix,flist,score):
    cluster_id = []
    for fidx, file_ in enumerate(flist):
        link_ = np.where(linkage_matrix[:,:2] == fidx)[0]
        if len(link_) == 1:
            link = link_[0]
            if linkage_matrix[link][2] <= score:
                fcluster_idx = str(link).zfill(len(str(len(linkage_matrix))))
                while True:
                    match = np.where(linkage_matrix[:,:2] == link+1+len(linkage_matrix))[0]
                    if len(match) == 1:
                        link = match[0]
                        link_d = linkage_matrix[link]
                        if link_d[2] <= score:
                            fcluster_idx = str(match[0]).zfill(len(str(len(linkage_matrix)))) + fcluster_idx
                        else:
                            break
                    else:
                        break
            else:
                fcluster_idx = None

            cluster_id.append(fcluster_idx)

    return cluster_id

def get_exitcode_stdout_stderr(cmd):
    """
    Execute the external command and get its exitcode, stdout and stderr.
    """
    args = shlex.split(cmd)

    proc = Popen(args, stdout=PIPE, stderr=PIPE, close_fds=True)
    out, err = proc.communicate()
    exitcode = proc.returncode

    del proc

    return exitcode, out, err

if os.path.isdir(sys.argv[1]):
    start_time = datetime.now()
    # allow PIL to load truncated images (so that perceptual hashes can be created for truncated/damaged images still)
    ImageFile.LOAD_TRUNCATED_IMAGES = True

    # image files this script will handle
    # PIL supported image formats
    stdimageext = ('.jpg','.jpeg', '.bmp', '.png', '.gif', '.tif', '.tiff')
    # libraw/ufraw supported formats
    rawimageext = ('.nef', '.dng', '.tif', '.tiff')

    corruptRegex = re.compile(r'_\[.+\]\..{3,4}$')
    groupRegex = re.compile(r'^\[\d+\]_')
    ufrawRegex = re.compile(r'Corrupt data near|Unexpected end of file|has the wrong dimensions!|Cannot open file|Cannot decode file|requests a nonexistent image!')

    for subdirs,dirs,files in os.walk(sys.argv[1]):
        files.clear()
        dirs.clear()
        for root,_,files in os.walk(subdirs):
            print('\n******** Processing files in ' + root)
            hashes = []
            w_hash = []
            w_hash_idx = []
            filelist = []
            files_ = []
            cnt = 0
            for f in files:
                #cnt = cnt + 1
                #if cnt < 10:
                files_.append(f)
                continue
            cnt = 0

            for f_idx, fname in enumerate(files_):
                e=None
                ext = os.path.splitext(fname.lower())[1]
                filepath = os.path.join(root, fname)

                imformat = ''
                hashes_tmp = []

                # reset corrupt string
                corrupt_str = None

                if ext in (stdimageext + rawimageext):
                    print(str(int(round(((f_idx+1)/len(files_))*100))) + '%' + ' : ' + fname + '....', end='', flush=True)
                    try:
                        with wImage(filename=filepath) as im:
                            imformat = '.' + im.format.lower()
                            ext = imformat if imformat is not '' else ext
                            with im.convert('jpeg') as converted:
                                jpeg_bin = converted.make_blob()
                                with Image.open(BytesIO(jpeg_bin)) as im2:
                                    hash_image = []
                                    for x in range(3):
                                        print('.',end='',flush=True)
                                        hash_i = str(imagehash.dhash(im2.rotate(90 * x, expand=1),32))
                                        if ''.join(set(hash_i)) != '0':
                                            hash_image.append(hash_i)
                                    if hash_image:
                                        hash_image.append(1)
                                        hashes_tmp.append(hash_image)
                    except:
                        e = sys.exc_info()[0]
                        errcode = str([k for k, v in wand.exceptions.TYPE_MAP.items() if v == e][0]).zfill(3)
                        if int(errcode[-2:]) in (15,25,30,35,40,50,55):
                            corrupt_str = 'magick'
                    finally:
                        try:
                            im.close()
                        except:
                            pass
                        try:
                            im2.close()
                        except:
                            pass

                    if ext in (stdimageext):
                        try:
                            with Image.open(filepath) as im:
                                hash_image = []
                                for x in range(3):
                                    print('.',end='',flush=True)
                                    hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                    if ''.join(set(hash_i)) != '0':
                                        hash_image.append(hash_i)
                                if hash_image:
                                    hash_image.append(2)
                                    hashes_tmp.append(hash_image)
                        except:
                            pass
                        finally:
                            try:
                                im.close()
                            except:
                                pass

                        # use jpeginfo against all jpg images as its pretty accurate
                        if ext in ('.jpg','.jpeg'):
                            #rc = 0
                            print('.',end='',flush=True)
                            cmd = 'jpeginfo --check "' + filepath + '"'
                            exitcode, out, err = get_exitcode_stdout_stderr(cmd)
                            #rc = call(["jpeginfo", "--check", filepath], stdout=DEVNULL, stderr=DEVNULL, close_fds=True)
                            if exitcode == 1:
                                corrupt_str = 'JpegInfo' if corrupt_str == None else corrupt_str
                            #del rc

                        if corrupt_str is None:
                            try:
                                with Image.open(filepath) as im:
                                    print('.',end='',flush=True)
                                    im.verify()
                            except:
                                e = sys.exc_info()[0]
                                corrupt_str = 'PIL_Verify' if corrupt_str == None else corrupt_str
                            else:
                                try:
                                    with Image.open(filepath) as im:
                                        print('.',end='',flush=True)
                                        temp = im.copy()
                                        im.load()
                                except:
                                    e =  sys.exc_info()[0]
                                    corrupt_str = 'PIL_Load' if corrupt_str == None else corrupt_str
                                finally:
                                    try:
                                        temp.close()
                                    except:
                                        pass
                                    try:
                                        im.close()
                                    except:
                                        pass
                            finally:
                                try:
                                    im.close()
                                except:
                                    pass
                                try:
                                    temp.close()
                                except:
                                    pass

                    # raw image processing
                    if ext in (rawimageext):
                        print('.',end='',flush=True)
                        # try to load raw using libraw via rawpy first, 
                        # generally if libraw can't load it then ufraw extraction would also fail
                        if corrupt_str == None:
                            try:
                                with rawpy.imread(filepath) as raw:
                                    rgb = raw.postprocess(use_camera_wb=True)
                                    temp_raw = NamedTemporaryFile(suffix='.jpg')
                                    Image.fromarray(rgb).save(temp_raw.name)
                                    with Image.open(temp_raw.name) as im:
                                        hash_image = []
                                        for x in range(3):
                                            print('.',end='',flush=True)
                                            hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                            if ''.join(set(hash_i)) != '0':
                                                hash_image.append(hash_i)
                                        if hash_image:
                                            hash_image.append(3)
                                            hashes_tmp.append(hash_image)

                            except(rawpy.LibRawFatalError):
                                e = sys.exc_info()[1]
                                corrupt_str = 'Libraw_FE'
                            except(rawpy.LibRawNonFatalError):
                                e = sys.exc_info()[1]
                                corrupt_str = 'Libraw_NFE'
                            except:
                                #print(sys.exc_info())
                                corrupt_str = 'Libraw'

                            finally:
                                try:
                                    im.close()
                                except:
                                    pass
                                try:
                                    temp_raw.close()
                                except:
                                    pass
                                try:
                                    raw.close()
                                except:
                                    pass
                            if corrupt_str == None:
                                # as a final last ditch effort compare perceptual hashes of extracted 
                                # raw and embedded preview to detect possible internal corruption 

                                # extract and convert raw to jpeg image using ufraw
                                temp_raw = NamedTemporaryFile(suffix='.jpg')
                                #rc = 0
                                cmd = 'ufraw-batch --wb=camera --rotate=camera --out-type=jpg --compression=95 --noexif --lensfun=none --auto-crop --output=' + temp_raw.name + ' --overwrite "' + filepath + '"'
                                print('.',end='',flush=True)
                                exitcode, out, err = get_exitcode_stdout_stderr(cmd)
                                if exitcode == 1 or ufrawRegex.search(str(err)) is not None:
                                    corrupt_str = 'Ufraw' if corrupt_str is None else corrupt_str

                                tmpfilesize = os.stat(temp_raw.name).st_size
                                if tmpfilesize > 0:
                                    try:
                                        with Image.open(temp_raw.name) as im:
                                            hash_image = []
                                            for x in range(3):
                                                print('.',end='',flush=True)
                                                hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                if ''.join(set(hash_i)) != '0':
                                                    hash_image.append(hash_i)
                                            if hash_image:
                                                hash_image.append(4)
                                                hashes_tmp.append(hash_image)
                                    except:
                                        pass
                                    finally:
                                        try:
                                            im.close()
                                        except:
                                            pass
                                try:
                                    temp_raw.close()
                                except:
                                    pass


                        # attempt to extract preview images
                        imfile = filepath
                        try:
                            with pyexiv2.ImageMetadata(imfile) as metadata_orig:
                                metadata_orig.read()
                                #for i,p in enumerate(metadata_orig.previews):
                                if metadata_orig.previews:
                                    preview = metadata_orig.previews[-1]
                                    # save preview to temp file
                                    temp_preview = NamedTemporaryFile()
                                    preview.write_to_file(temp_preview.name)
                                    os.rename(temp_preview.name + preview.extension, temp_preview.name)

                                    try:
                                        with Image.open(temp_preview.name) as im:
                                            hash_image = []
                                            for x in range(3):
                                                print('.',end='',flush=True)
                                                hash_i = str(imagehash.dhash(im.rotate(90 * x, expand=1),32))
                                                if ''.join(set(hash_i)) != '0':
                                                    hash_image.append(hash_i)
                                            if hash_image:
                                                hash_image.append(5)
                                                hashes_tmp.append(hash_image)
                                    except:
                                        pass
                                    finally:
                                        try:
                                            temp_preview.close()
                                        except:
                                            pass
                                        try:
                                            im.close()
                                        except:
                                            pass
                        except:
                            pass
                        finally:
                            try:
                                metadata_orig.close()
                            except:
                                pass

                    # compare hashes for all images that were found or extracted and find most dissimilar hamming distance (worst)
                    if len(hashes_tmp) > 1:
                        #print('checking_hashes')
                        print('.',end='',flush=True)
                        scores = []

                        for h_idx, hash in enumerate(hashes_tmp):
                            i = h_idx + 1
                            while i < len(hashes_tmp):
                                ham_dist = 1
                                for h1 in hash[:-1]:
                                    for h2 in hashes_tmp[i][:-1]:
                                        ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
                                if (hash[-1] == 5 and hashes_tmp[i][-1] != 5) or (hash[-1] != 5 and hashes_tmp[i][-1] == 5):
                                    scores.append([ham_dist,hash[-1],hashes_tmp[i][-1]])
                                i = i + 1
                        if scores:
                            worst = sorted(scores, key = lambda x: x[0])[-1]

                            if worst[0] > 0.3:
                                worst1 = str(worst[1])
                                worst2 = str(worst[2])
                                corrupt_str = 'hash' + str(round(worst[0]*100,2)) + '_' + worst1 + '-' + worst2 if corrupt_str == None else corrupt_str

                    # prefix file if corruption was detected ensuring that existing files already prefixed are re prefixed
                    mo = corruptRegex.search(fname)
                    newfilename = None
                    if corrupt_str is not None:
                        print('Corrupt: ' + corrupt_str)
                        if mo is not None:
                            newfilename = re.sub(corruptRegex, '_[' + corrupt_str + ']', fname) + ext
                        else:
                            newfilename = os.path.splitext(fname)[0] + '_[' + corrupt_str + ']' + ext
                    else:
                        print('OK!')
                        if mo is not None:
                            newfilename = re.sub(corruptRegex, '', fname) + ext

                    # remove group index from name if present, this will be assigned in the next step if needed
                    newfilename = newfilename if newfilename is not None else fname
                    mo = groupRegex.search(newfilename)
                    if mo is not None:
                        newfilename = re.sub(groupRegex, '', newfilename)

                    if hashes_tmp:
                        # set function unduplicates flattened list
                        hashes.append(set([item for sublist in hashes_tmp for item in sublist[:-1]]))

                    filelist.append([root,fname,newfilename, len(hashes_tmp)])


            print('******** Grouping similar images... ************')
            if len(hashes) > 1:
                scores = []
                for h_idx, hash in enumerate(hashes):
                    i = h_idx + 1
                    while i < len(hashes):
                        ham_dist = 1
                        for h1 in hash:
                            for h2 in hashes[i]:
                                ham_dist = min(ham_dist, (sum(bool(ord(ch1) - ord(ch2)) for ch1, ch2 in zip(h1, h2)))/len(h1))
                        scores.append(ham_dist)
                        i = i + 1
                X = np.array(scores)

                linkage = fastcluster.single(X)
                w_hash_idx = [el_idx for el_idx, el in enumerate(filelist) if el[3] > 0]
                w_hash = [filelist[i] for i in w_hash_idx]

                test=ct(linkage,[el[2] for el in w_hash],.2)
                for i, prfx in enumerate(test):
                    curfilename = w_hash[i][2]

                    mo = groupRegex.search(curfilename)
                    newfilename = None

                    if prfx is not None:
                        if mo is not None:
                            newfilename = re.sub(groupRegex, '[' + prfx + ']_', curfilename)
                        else:
                            newfilename = '[' + prfx + ']_' + curfilename
                    else:
                        if mo is not None:
                            newfilename = re.sub(groupRegex, '', curfilename)

                #    if newfilename is not None:
                    filelist[w_hash_idx[i]][2] = newfilename if newfilename is not None else curfilename

                #fig = plt.figure(figsize=(25,25))
                #plt.title(root)
                #plt.xlabel('perpetual hash hamming distance')

                #plt.axvline(x=.15,c='red',linestyle='--')
                #dg = dendrogram(linkage, labels=[el[2] for el in w_hash], orientation='right', show_leaf_counts=True)
                #ax = fig.gca()
                #ax.set_xlim(-.02,ax.get_xlim()[1])
                #plt.show
                #plt.savefig(os.path.join(root,'dendrogram.pdf'), bbox_inches='tight', dpi=100)
                w_hash.clear()
                w_hash_idx.clear()
            print('******** Renameing file if applicable... ************')
            for fr in filelist:
                if fr[1] != fr[2]:
                    #print(fr[1] + ' -- ' + fr[2])
                    path = fr[0]
                    os.rename(os.path.join(path,fr[1]),os.path.join(path,fr[2]))


            filelist.clear()

    duration = datetime.now() - start_time
    days    = divmod(duration.total_seconds(), 86400)        # Get days (without [0]!)
    hours   = divmod(days[1], 3600)               # Use remainder of days to calc hours
    minutes = divmod(hours[1], 60)                # Use remainder of hours to calc minutes
    seconds = divmod(minutes[1], 1)               # Use remainder of minutes to calc seconds
    print("Time to complete: %d days, %d:%d:%d" % (days[0], hours[0], minutes[0], seconds[0]))