pythonscipyexpectation-maximization

multivariate_normal.pdf() returns zeros?


I am trying to use Expectation Maximization (EM) to classify my data. But due to my limited experience with Python and algorithms, when I writing the E-step of the algorithm, the method multivariate_normal.pdf() returns zeros and I have no idea why does this happen and how do I fix it.

My parameters set:

n_clusters = 2
n_points = len(X)
m1 = X[:,0].mean() 
m2 = X[:,1].mean() 
Mu = [[m1, m2*1.1], [m1, m2*0.9]]
Var = [[1, 1], [1, 1]]
Pi = [1 / n_clusters] * 2
W = np.ones((n_points, n_clusters)) / n_clusters 
Pi = W.sum(axis=0) / W.sum()

E-step: enter image description here

Function for E-step:

def update_W(X, Mu, Var, Pi):
    n_points, n_clusters = len(X), len(Pi)
    pdfs = np.zeros(((n_points, n_clusters)))
    for i in range(n_clusters):
        pdfs[:, i] = Pi[i] * multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))
    W = pdfs / pdfs.sum(axis=1).reshape(-1, 1)
    return W

The problem is that the code

multivariate_normal.pdf(X, Mu[i], np.diag(Var[i]))

returns zeros and it does not make sense to the algorithm.

My data is here:

X = np.array([[5.04729, 37744.57848238945],
 [5.0008, 35899.206357479095],
 [4.98011, 33056.83955574036],
 [4.9931, 33704.46286725998],
 [4.96448, 32990.76251792908],
 [4.99008, 32283.805765628815],
 [5.04159, 34409.80894064903],
 [5.00075, 35270.9743578434],
 [5.00899, 36284.97508478165],
 [4.97215, 35458.088497161865],
 [4.9763, 35349.08898703754],
 [5.02148, 38928.51481676102],
 [4.96585, 33278.32470035553],
 [5.02545, 41023.1229429245],
 [5.00402, 39417.85598278046],
 [5.02445, 38158.13270044327],
 [5.02238, 36516.55680179596],
 [4.98606, 34646.19884157181],
 [5.0033, 50677.906705379486],
 [5.02733, 49949.25080680847],
 [5.02716, 39155.43921852112],
 [5.04183, 39191.40325546265],
 [4.96896, 38266.83190345764],
 [4.99652, 36774.488584041595],
 [5.03478, 33207.51669681072],
 [5.03832, 34724.067808151245],
 [5.02347, 32981.52464199066],
 [4.96275, 33158.869076013565],
 [4.97127, 33810.3498980999],
 [5.01482, 35844.073690891266],
 [4.97365, 36869.271273851395],
 [5.00685, 34672.39159619808],
 [4.98033, 34688.13522028923],
 [5.00265, 32825.98570096493],
 [5.00083, 34160.511184334755],
 [5.01027, 33884.14221894741],
 [4.95678, 38457.50146174431],
 [4.96867, 36320.28945875168],
 [5.00721, 33037.772767663],
 [5.04357, 33117.379173755646],
 [5.0004, 34557.091692984104],
 [5.02337, 34421.044409394264],
 [5.01924, 36974.69707453251],
 [5.04621, 38812.16114796698],
 [4.96225, 33006.868394851685],
 [5.03166, 33958.025827884674],
 [4.95124, 34349.11486387253],
 [5.02567, 34259.3874104023],
 [5.04868, 41895.62332487106],
 [5.00072, 34009.417558074],
 [5.04818, 35927.81018912792],
 [4.97368, 33997.19724833965],
 [4.97671, 33999.519283890724],
 [4.95128, 31926.064946889877],
 [4.95742, 33177.98467195034],
 [4.9803, 36231.32753157616],
 [4.99018, 35190.21499049879],
 [5.00053, 34551.94658563426],
 [4.96793, 33994.06255364418],
 [4.98345, 32120.49098587036],
 [5.04564, 30271.260227680206],
 [4.95421, 30653.037763834],
 [5.03442, 31966.675320148468],
 [4.9722, 33757.07542133331],
 [5.01377, 33405.48862147331],
 [5.03417, 33558.4126021862],
 [5.02044, 30911.697856664658],
 [4.98318, 33883.324236392975],
 [4.99876, 36436.7568192482],
 [4.98475, 34489.98227977753],
 [5.00355, 30205.769625544548],
 [5.03646, 33422.97079265118],
 [5.04884, 34117.2653195858],
 [5.02259, 34993.6039069891],
 [4.9737, 47981.5806927681],
 [5.00593, 56685.62618494034],
 [5.00434, 54395.169895286555],
 [4.95768, 35611.35310435295],
 [5.02978, 37536.10489606857],
 [4.95214, 35577.900700092316],
 [4.99925, 33014.68128633499],
 [5.03029, 29865.81333065033],
 [4.95128, 28760.52792918682],
 [5.02051, 37038.269605994225],
 [5.04974, 36642.584582567215],
 [4.96896, 29041.81211209297],
 [4.97972, 37114.91027057171],
 [4.99165, 36424.69217658043],
 [4.95779, 33859.12481832504],
 [5.01495, 32957.517973423004],
 [4.95671, 32964.28135430813],
 [5.00732, 34051.169529914856],
 [4.98396, 32881.329072237015],
 [5.02704, 34020.69495886564],
 [4.98323, 34851.629052996635],
 [5.02463, 34762.618599653244],
 [4.95826, 31249.48672771454],
 [5.02317, 35223.82992994785],
 [4.98121, 31181.02133178711],
 [5.00584, 28375.87954646349],
 [4.98506, 31617.50852251053],
 [5.02145, 31114.64755296707],
 [5.03362, 31182.910351395607],
 [5.03503, 30093.33342540264],
 [4.95797, 36398.88142120838],
 [4.97227, 35468.531022787094],
 [5.00365, 35569.601973861456],
 [4.96697, 33233.918072104454],
 [5.01004, 31831.104349255562],
 [5.01838, 36176.26664984226],
 [5.01747, 35596.35544002056],
 [5.0358, 32386.7973613739],
 [4.973, 33052.33617353439],
 [5.03017, 32395.227126598358],
 [4.98371, 34437.29500854015],
 [5.02143, 31013.464814782143],
 [5.02339, 36368.26537466049],
 [5.02942, 36371.447618722916],
 [5.04456, 38037.0499355793],
 [5.02219, 40177.06857061386],
 [4.95168, 37964.482201099396],
 [4.99589, 38579.15203022957],
 [4.99336, 34731.92454767227],
 [4.98333, 34161.41907787323],
 [4.95841, 30232.777291665203],
 [4.95133, 30595.70202767849],
 [5.0252, 30708.32647585869],
 [5.04505, 35107.191153526306],
 [4.98019, 35513.3441824913],
 [4.97158, 30879.74526977539],
 [4.96096, 33175.533081412315]])

Thanks in advance!


Solution

  • Let's have a look at some actual numbers:

    >>> Mu[0]
    [5.000155496183207, 38591.28312042943]
    >>> Var[0]
    [1, 1]
    >>> X[:5]
    array([[5.04729, 37744.5785],
           [5.00080, 35899.2064],
           [4.98011, 33056.8396],
           [4.99310, 33704.4629],
           [4.96448, 32990.7625]])
    

    Now, let's try the following value and see what we get:

    >>> multivariate_normal.pdf([5, 38561.5785], Mu[0], np.diag(Var[0]))
    3.970168119488602e-193
    

    You get a really really low probability value but at least it's not zero.

    Let's ignore the first dimension for a second as it isn't the problematic one: 38561 is about 30 standard deviations away from 38591.

    For a univariate normal distribution we know that the probability of being more than 3 standard deviations away from the mean is just 0.3% and it decreases very very fast:

    enter image description here

    Look at the values stored in X - they are much much further away from Mu[0] than 30 standard deviations. These values are so unlikely to come from this distribution that as far as Python is concerned, the probability of that happening is 0.0.

    A better idea would be to start with the actual covariance of your data:

    Var = np.cov(X.T) + 1
    Var = [Var, Var]
    

    Then suddenly you get decent probabilities :)

    P.S. I am adding 1 to the covariance matrix to avoid singularity.