I'm looking for ways to find image duplicates by fingerprinting. I understand that this is done by applying hash functions on images, and each image would have a unique hash value.
I am fairly new to image processing and don't know much about hashing. How exactly am I supposed to apply hash functions and generate hash values?
Thanks in advance
You need to be careful with hashing, some image formats, such as JPEG and PNG, store dates/times and other information within images and that will make two identical images appear to be different to normal tools such as md5
and cksum
.
Here is an example. Make two images, both identical red squares of 128x128 at the command line in Terminal with ImageMagick
convert -size 128x128 xc:red a.png
convert -size 128x128 xc:red b.png
Now check their MD5 sums:
md5 [ab].png
MD5 (a.png) = b4b82ba217f0b36e6d3ba1722f883e59
MD5 (b.png) = 6aa398d3aaf026c597063c5b71b8bd1a
Or their checksums:
cksum [ab].png
4158429075 290 a.png
3657683960 290 b.png
Oops, they are different according to both md5
and cksum
. Why? Because the dates are 1 second apart.
I would suggest you use ImageMagick to checksum "just the image data" and not the metadata - unless, of course, the date is important to you:
identify -format %# a.png
e74164f4bab2dd8f7f612f8d2d77df17106bac77b9566aa888d31499e9cf8564
identify -format %# b.png
e74164f4bab2dd8f7f612f8d2d77df17106bac77b9566aa888d31499e9cf8564
Now they are both the same, because the image is the same - just the metadata differs.
Of course, you may be more interested in "Perceptual Hashing" where you just get an idea if two images "look similar". If so, look here.
Or you may be interested in allowing slight differences in brightness, or orientation, or cropping - which is another topic altogether.