c++hashlocality-sensitive-hashphash

Locality Sensitive Hash or pHash?


I'm trying to implement a general fingerprint memoizator: we have a file that can be expressed through an intelligent fingerprint (like pHash for images or chromaprint for audio) and if our desidered (expensive) function has already been computed on a similar file, then we return the same result (avoiding expensive computation).

Locality Sensitive Hash (LSH) is popular and well-performant solution for the Approximate nearest neighbor problem in an expensive multi-dimensional space.

pHash is a good library which implements perceptual hashing for images.

So pHash transform a multi-dimensional input (an image) to a one-dimensional object (an hash code), which is something different from LSH (again, multi-dimensional objects in LSH).

So I'm wondering how we could implement a mono-dimensional LSH for pHash hash values? Or in a few words: how we can group in bins similar pHash values? Could it be alternative to the classic LSH approach (and if not why)?


Solution

  • You could use n random projections to split pHash space into 2^n buckets, then similar images are most likely found from the same bucket. You could even XOR the hash with all 64 possible integers with Hamming weight 1 to check neighboring buckets conveniently and be sure you'd find all approximate matches.

    This is efficient only if you are interested on images with almost identical hashes (small Hamming distance). If you want to tolerate larger hamming distances (such as 8) then it gets tricky to find all matches efficiently and accurately. I got very good performance by scanning through the whole table by GPU, even my 3 year old laptop's GT 650M could check 700 million hashes / second!

    Edit 1: You can think 64-bit hash as a single corner on a 64-dimensional cube, math is easier if your normalize corner coordinates to -1 and 1 (this way its center is in the origin). You can express m images as a matrix M of size m x 64 (one row / image, one bit of hash / column).

    Simplest way to split this to 2^n distinct groups is to generate n 64-dimensional vectors v_0, v_1, ..., v_n (pick each vector element from normal distribution N(0,1)), this can be expressed as a matrix V of size 64 x n (one column / vector). There could be orthogonality enforcement as mentioned at Random projection but I'll skip it here.

    Now by calculating A = (M * V) > 0 you get m x n matrix (one image / row, one projection / column). Next convert each row's binary representation to a number, you get 2^n different possibilities and similar hashes are most likely to end up to the same bucket.

    This algorithm works for any orthogonal representation of data (such as SURF features), not just binary strings. I'm sure there are simpler (and computationally more efficient) algorithms for binary hashes but this is one way to implement random projections.

    I suggested XORring because if images don't have identical hashes then they aren't guaranteed to end up in the same bucket. By checking all possible small deviations from the original hash you can see which other bins are possible for likely matches.

    In a way this is similar how a computer game engine might split the 2D map into a grid of cells of size x, then to find all units within a radius x from a point you only need to check 9 cells (the one containing the point + 8 surrounding cells) to get a 100% accurate answer.