I am trying to use Expectation Maximization (EM) to classify my data. But due to my limited experience with Python and algorithms, when I writing the E-step of the algorithm, the method multivariate_normal.pdf() returns zeros and I have no idea why does this happen and how do I fix it.
My parameters set:
n_clusters = 2
n_points = len(X)
m1 = X[:,0].mean()
m2 = X[:,1].mean()
Mu = [[m1, m2*1.1], [m1, m2*0.9]]
Var = [[1, 1], [1, 1]]
Pi = [1 / n_clusters] * 2
W = np.ones((n_points, n_clusters)) / n_clusters
Pi = W.sum(axis=0) / W.sum()
Function for E-step:
def update_W(X, Mu, Var, Pi):
n_points, n_clusters = len(X), len(Pi)
pdfs = np.zeros(((n_points, n_clusters)))
for i in range(n_clusters):
pdfs[:, i] = Pi[i] * multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))
W = pdfs / pdfs.sum(axis=1).reshape(-1, 1)
return W
The problem is that the code
multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))
returns zeros and it does not make sense to the algorithm.
My data is here:
X = np.array([[5.04729, 37744.57848238945],
[5.0008, 35899.206357479095],
[4.98011, 33056.83955574036],
[4.9931, 33704.46286725998],
[4.96448, 32990.76251792908],
[4.99008, 32283.805765628815],
[5.04159, 34409.80894064903],
[5.00075, 35270.9743578434],
[5.00899, 36284.97508478165],
[4.97215, 35458.088497161865],
[4.9763, 35349.08898703754],
[5.02148, 38928.51481676102],
[4.96585, 33278.32470035553],
[5.02545, 41023.1229429245],
[5.00402, 39417.85598278046],
[5.02445, 38158.13270044327],
[5.02238, 36516.55680179596],
[4.98606, 34646.19884157181],
[5.0033, 50677.906705379486],
[5.02733, 49949.25080680847],
[5.02716, 39155.43921852112],
[5.04183, 39191.40325546265],
[4.96896, 38266.83190345764],
[4.99652, 36774.488584041595],
[5.03478, 33207.51669681072],
[5.03832, 34724.067808151245],
[5.02347, 32981.52464199066],
[4.96275, 33158.869076013565],
[4.97127, 33810.3498980999],
[5.01482, 35844.073690891266],
[4.97365, 36869.271273851395],
[5.00685, 34672.39159619808],
[4.98033, 34688.13522028923],
[5.00265, 32825.98570096493],
[5.00083, 34160.511184334755],
[5.01027, 33884.14221894741],
[4.95678, 38457.50146174431],
[4.96867, 36320.28945875168],
[5.00721, 33037.772767663],
[5.04357, 33117.379173755646],
[5.0004, 34557.091692984104],
[5.02337, 34421.044409394264],
[5.01924, 36974.69707453251],
[5.04621, 38812.16114796698],
[4.96225, 33006.868394851685],
[5.03166, 33958.025827884674],
[4.95124, 34349.11486387253],
[5.02567, 34259.3874104023],
[5.04868, 41895.62332487106],
[5.00072, 34009.417558074],
[5.04818, 35927.81018912792],
[4.97368, 33997.19724833965],
[4.97671, 33999.519283890724],
[4.95128, 31926.064946889877],
[4.95742, 33177.98467195034],
[4.9803, 36231.32753157616],
[4.99018, 35190.21499049879],
[5.00053, 34551.94658563426],
[4.96793, 33994.06255364418],
[4.98345, 32120.49098587036],
[5.04564, 30271.260227680206],
[4.95421, 30653.037763834],
[5.03442, 31966.675320148468],
[4.9722, 33757.07542133331],
[5.01377, 33405.48862147331],
[5.03417, 33558.4126021862],
[5.02044, 30911.697856664658],
[4.98318, 33883.324236392975],
[4.99876, 36436.7568192482],
[4.98475, 34489.98227977753],
[5.00355, 30205.769625544548],
[5.03646, 33422.97079265118],
[5.04884, 34117.2653195858],
[5.02259, 34993.6039069891],
[4.9737, 47981.5806927681],
[5.00593, 56685.62618494034],
[5.00434, 54395.169895286555],
[4.95768, 35611.35310435295],
[5.02978, 37536.10489606857],
[4.95214, 35577.900700092316],
[4.99925, 33014.68128633499],
[5.03029, 29865.81333065033],
[4.95128, 28760.52792918682],
[5.02051, 37038.269605994225],
[5.04974, 36642.584582567215],
[4.96896, 29041.81211209297],
[4.97972, 37114.91027057171],
[4.99165, 36424.69217658043],
[4.95779, 33859.12481832504],
[5.01495, 32957.517973423004],
[4.95671, 32964.28135430813],
[5.00732, 34051.169529914856],
[4.98396, 32881.329072237015],
[5.02704, 34020.69495886564],
[4.98323, 34851.629052996635],
[5.02463, 34762.618599653244],
[4.95826, 31249.48672771454],
[5.02317, 35223.82992994785],
[4.98121, 31181.02133178711],
[5.00584, 28375.87954646349],
[4.98506, 31617.50852251053],
[5.02145, 31114.64755296707],
[5.03362, 31182.910351395607],
[5.03503, 30093.33342540264],
[4.95797, 36398.88142120838],
[4.97227, 35468.531022787094],
[5.00365, 35569.601973861456],
[4.96697, 33233.918072104454],
[5.01004, 31831.104349255562],
[5.01838, 36176.26664984226],
[5.01747, 35596.35544002056],
[5.0358, 32386.7973613739],
[4.973, 33052.33617353439],
[5.03017, 32395.227126598358],
[4.98371, 34437.29500854015],
[5.02143, 31013.464814782143],
[5.02339, 36368.26537466049],
[5.02942, 36371.447618722916],
[5.04456, 38037.0499355793],
[5.02219, 40177.06857061386],
[4.95168, 37964.482201099396],
[4.99589, 38579.15203022957],
[4.99336, 34731.92454767227],
[4.98333, 34161.41907787323],
[4.95841, 30232.777291665203],
[4.95133, 30595.70202767849],
[5.0252, 30708.32647585869],
[5.04505, 35107.191153526306],
[4.98019, 35513.3441824913],
[4.97158, 30879.74526977539],
[4.96096, 33175.533081412315]])
Thanks in advance!
Let's have a look at some actual numbers:
>>> Mu[0]
[5.000155496183207, 38591.28312042943]
>>> Var[0]
[1, 1]
>>> X[:5]
array([[5.04729, 37744.5785],
[5.00080, 35899.2064],
[4.98011, 33056.8396],
[4.99310, 33704.4629],
[4.96448, 32990.7625]])
Now, let's try the following value and see what we get:
>>> multivariate_normal.pdf([5, 38561.5785], Mu[0], np.diag(Var[0]))
3.970168119488602e-193
You get a really really low probability value but at least it's not zero.
Let's ignore the first dimension for a second as it isn't the problematic one: 38561
is about 30
standard deviations away from 38591
.
For a univariate normal distribution we know that the probability of being more than 3
standard deviations away from the mean is just 0.3%
and it decreases very very fast:
Look at the values stored in X
- they are much much further away from Mu[0]
than 30 standard deviations. These values are so unlikely to come from this distribution that as far as Python is concerned, the probability of that happening is 0.0
.
A better idea would be to start with the actual covariance of your data:
Var = np.cov(X.T) + 1
Var = [Var, Var]
Then suddenly you get decent probabilities :)
P.S. I am adding 1 to the covariance matrix to avoid singularity.