I want to calculate the mutual information between two numpy vectors:
>>>from sklearn.metrics.cluster import mutual_info_score
>>>import numpy as np
>>>a, b = np.random.rand(10), np.random.rand(10)
>>>mutual_info_score(a, b)
1.6094379124341005
>>>a, b = np.random.rand(10), np.random.rand(10)
>>>mutual_info_score(a, b)
1.6094379124341005
As you can see, although I updated a
and b
, it returned the same value. Then I tried another example:
>>>a = np.array([167.52523295, 73.2904335 , 98.61953303, 152.17297007,
211.01341451, 327.72296346, 356.60500081, 43.9371432 ,
119.09474284, 125.20180842])
>>>b = np.array([280.9287028 , 131.76304983, 176.0277832 , 188.56630096,
229.09811401, 228.47200012, 617.67000122, 52.7211511 ,
125.95361582, 148.55247447])
>>>mutual_info_score(a, b)
2.302585092994046
>>>a = np.array([ 6.71381009, 1.43607653, 3.78729242, -4.75706796, -3.81281173,
3.23440092, 10.84495625, -0.19646145, 4.09724507, -0.13858104])
>>>b = np.array([ 4.25330873, 3.02197642, -3.2833848 , 0.41855662, -3.74693531,
0.7674982 , 11.36459148, 0.64636462, 0.51817262, 1.65318943])
>>>mutual_info_score(a, b)
2.302585092994046
Why? Look at the difference between those numbers. Why it returns the same value? More importantly, how do I calculate the MI between two vectors?
In that case, you will obtain different numbers each time you run the cell. Here, you're utilizing a method that is suitable for measuring the quality of clustering results!
Let's quickly jump into the principal material. For observing the mutual information (MI) between two vectors (or even several vectors), you can use the mutual_info_regression
function (as described here):
In [1]: from sklearn.feature_selection import mutual_info_regression
In [2]: a, target = np.random.rand(10, 3)+300, np.random.rand(10)
In [3]: mi = mutual_info_regression(a, target)
In [4]: mi
Out[4]: array([0.18373016, 0.19396825, 0.09634921])
In the above, I calculated the MI between each feature of the a
with the target
! E.g., the MI between the first feature and the target
is ~0.184. There are various ways to calculate MI between variables, e.g.:
estimate mutual information (MI) with histograms. E.g., code:
from sklearn.metrics import mutual_info_score
def MI(x, y, bins):
c_xy = np.histogram2d(x, y, bins)[0]
mi = mutual_info_score(None, None, contingency=c_xy)
return mi
The challenge is finding a suitable value for the number of bins
here. [1]
based on entropy estimation from k-nearest neighbors' distances (mutual_info_regression
is based on this approach)
etc.
P.S. Reading this document is worthwhile.