Introduction
I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:
% Input
bin_counts_a = [689 430 311 135 66 67 99 23 37 19 8 4 3 4 1 3 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1];
bin_counts_b = [569 402 200 166 262 90 50 16 33 12 6 35 49 4 12 8 8 2 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1];
% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])
% Calculation of similarities
cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
% Output
cosine_similarity =
0.95473215802008
jaccard_similarity =
0.0769230769230769
Question
If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?
The 'jaccard'
measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.
For instance, assume bin_counts_a
as in your example and
bin_counts_b = bin_counts_a + 1;
Then
>> cosine_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
0.999971577948095
is almost 1
as expected, because the bin counts are very similar. However,
>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
0
gives 0
because each entry in bin_counts_b
is (slightly) different from that in bin_counts_a
.
For assessing the similarity between the histograms, 'cosine'
is probably a more meaningful option than 'jaccard'
. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2
.