[SOLVED] Near-Duplicate Image Detection

Near-Duplicate Image Detection

What's a fast way to sort a given set of images by their similarity to each other.

At the moment I have a system that does histogram analysis between two images, but this is a very expensive operation and seems too overkill.

Optimally I am looking for a algorithm that would give each image a score (for example a integer score, such as the RGB Average) and I can just sort by that score. Identical Scores or scores next to each other are possible duplicates.

0299393
0599483
0499994 <- possible dupe
0499999 <- possible dupe
1002039
4995994
6004994

RGB Average per image sucks, is there something similar?

Solution

There has been a lot of research on image searching and similarity measures. It's not an easy problem. In general, a single int won't be enough to determine if images are very similar. You'll have a high false-positive rate.

However, since there has been a lot of research done, you might take a look at some of it. For example, this paper (PDF) gives a compact image fingerprinting algorithm that is suitable for finding duplicate images quickly and without storing much data. It seems like this is the right approach if you want something robust.

If you're looking for something simpler, but definitely more ad-hoc, this SO question has a few decent ideas.