I am finding A[i..j]
that is the most similar to B.
Here calcSimilarity
is function that returns similarity of two arrays.
Similarity is calculated as
Not than brute force search, I want to know what kind of data structure and algorithm is efficient in range search.
SAMPLE input/output
input: A: [(10,1), (20,1), (-200,2), (33,1), (42,1), (58,1)] B:[(20,1), (30,1), (1000,2)]
output: most similar Range is [1, 3]
match [20, 33] => [20, 30]
This is brute force search code.
struct object{
int type, value;
}A[10000],B[100];
int N, M;
int calcSimilarity(object X[], n, object Y[], m){
if(n > m) return calcSimilarity(Y, m, X, n);
for(all possible match){//match is (i, link[i])
int minDif = 0x7ffff;
int count = 0;
for( i = 0; i< n; i++){
int j = link[i];
int similar = similar(X[i], Y[j]);
minDif = min(similar, minDif);
}
}
if(count == 0) return 0x7fffff;
return minDif/pow(count,3);
}
find_most_similar_range(){
int minSimilar = 0x7fffff, minI, minJ;
for( i = 0; i < N; i ++){
for(j = i+1; j < N; j ++){
int similarity = calcSimilarity(A + i, j-i, B, M);
if (similarity < minSimilar)
{
minSimilar = similarity;
minI= i;
minJ = j;
}
}
}
printf("most similar Range is [%d, %d]", minI, minJ);
}
it will take O((N^M) * (N^2)).
That looks like the Big-O of the find similarity is N^2. With the pairwise comparison of each element.
So it looks more like
The pairwise comparison is M*(M-1). Each list has to be tested against each other list or about M^2.
This is a problem which has been solved for clustering, and there are data structures (e.g. Metric Tree), which allow the distances between similar objects to be stored in a tree.
When looking for the N closest neighbours, the search of this tree limits the number of pairwise comparisons needed and results in a O( ln(M) ) form
The downside of this particular tree, is the similarity measure needs to be metric. Where the distance between A and B, and the distance between B and C allows inferences to be made about the distance range of A and C.
If your similarity measure is not metric, then this can't be done.
Jaccard distance is a metric of distance which allows it to be placed in a Metric tree.