algorithmtime-complexitysimilaritymetric

fast similarity detection


I have a large collection of objects and I need to figure out the similarities between them.

To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size).

I need the ability to quickly find, given an object, the set of objects similar to it.

To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better.

Have you seen this problem before, or something similar to it? What is a good solution?

To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n2) where n is the number of objects. Is there a general solution with lower complexity?


Solution

  • Without knowing more details of the metric, it's hard to say. I don't have any ideas for eliminating the O(n^2) aspect, but there may be a way to reduce some of the constants involved. For example, if you had a Euclidean metric d(p,q) = sqrt( (p_1-q_1)^2 + ..+ (p_n-q_n)^2), you could square your distance d and compare it to the partial sums of (p_i-q_i)^2 and stop when you exceed d^2.

    Whether this will actually save you time depends on how expensive the compare is to just calculating the summands and how many summand calculations you could expect to avoid by doing this (obviously, the smaller d is, the better).