pythonpymc3mixture-modelgamma-distributionhierarchical-bayesian

PYMC3 Mixture model: help understanding multiple variables model


Let's say I have a dataframe with 4 variable. I want to see if I can generate a posterior of gamma mixtures over all the variables, with the goal to find clusters for each observation. I'm guessing I will need some sort of multivariate gamma distribution? But how would I go about this?

Here is some pymc3 code as an example with one parameter, looking for a mixture of two gammas (I have chosen arbitrary parameters):

with pm.Model() as m:
     p = pm.Dirichlet('p', a = np.ones(2))

     alpha = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
     beta = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)

     x = pm.Gammma('x', alpha, beta)

     comp_dist = pm.Gamma.dist(means, scale, shape = (2,))
     like = pm.Mixture('y', w = p,comp_dists = comp_dist, observed = data)

     trace = pm.sample(1000)

So my question is, how would I extend this basic example to multiple variables? I assume that I need to define relationships between the variables somehow to encode them in the model? I feel that I understand the basics of mixture modelling, but at the same time feel that I am missing something pretty fundamental.


Solution

  • Here's how the multidimensional case should work:

    J = 4 # num dimensions
    K = 2 # num clusters
    
    with pm.Model() as m:
        p = pm.Dirichlet('p', a=np.ones(K))
    
        alpha = pm.Gamma('alpha', alpha=1, beta=1, shape=(J,K))
        beta  = pm.Gamma('beta',  alpha=1, beta=1, shape=(J,K))
        gamma = pm.Gamma.dist(alpha=alpha, beta=beta, shape=(J,K))
    
        like = pm.Mixture('y', w=p, comp_dists=gamma, observed=X, shape=J)
    
        trace = pm.sample(1000)
    

    where X.shape should be (N,J).


    Note on Symmetry Breaking

    The difficult part is going to be resolving identifiability issues, but I think that's beyond the scope of the question. Maybe have a look at how the GMM tutorial breaks symmetry using the pm.Potential function. I expect highly-correlated parameterizations of the likelihood function(s), like alpha and beta, would exacerbate the issue, so perhaps consider switching to the mu and sigma parameterization.