javaimage-processingduplicatescbirphash

Using pHash to search agaist a huge image database, what is the best approach?


I need to search a huge image database to find possible duplicate using pHash assuming those image records have the hash code generated using the pHash.

Now I have to compare a new image and I have to create the hash for this using pHash against existing records. But as per my understanding the has comparison is NOT straight forward like

hash1 - has2 < threshold

Looks like I need to pass the both hash codes into a pHash API to do the matching.So I have to retrieve all hash codes from DB in batches and compare one by one using the pHash API.

But this looks not the best approach if I have about 1000 images in queue to be compared against the millions of already exiting images.

I need to know the followings.

  1. Is my understanding/approach on using pHash to compare with existing image db is correct?
  2. Is there a better approach to handle this (without using cbir libraries like lire)?
  3. I heard that there is an algorithm called dHash which also can be used for image comparison with hash codes..is there any java libraries for this and can this be used together with pHash to optimize this task of large image and repeated image processing tasks.

Thanks in advance.


Solution

  • I think some part of this question is discussed on the pHash support forum.

    You will need to use the mvptree storage mechanism

    http://lists.phash.org/htdig.cgi/phash-support-phash.org/2011-May/000122.html and http://lists.phash.org/htdig.cgi/phash-support-phash.org/2010-October/000103.html