pythonscikit-learngmmmixture

Can I fix the mean of one component of a Gaussian Mixture Model in python before fitting?


I am interested in fitting a 2-component Gaussian Mixture Model to the data shown below.log-transformed count ratio data, can't go above 0 However, since what I am plotting here are log-transformed counts normalized to be between 0-1, the maximum value my data will ever take is 0. When I try a naive fit using sklearn.mixture.GaussianMixture (code below), I get the resulting fit, which is obviously not what I want.

from sklearn.mixture import GaussianMixture
import numpy as np

# start with some count data in (0,1]
logged_counts = np.log(counts)
model = GaussianMixture(2).fit(logged_counts.reshape(-1,1))

# plot resulting fit
x_range = np.linspace(np.min(logged_counts), 0, 1000)
pdf = np.exp(model.score_samples(x_range.reshape(-1, 1)))
responsibilities = model.predict_proba(x_range.reshape(-1, 1))
pdf_individual = responsibilities * pdf[:, np.newaxis]

plt.hist(logged_counts, bins='auto', density=True, histtype='stepfilled', alpha=0.5)
plt.plot(x_range, pdf, '-k', label='Mixture')
plt.plot(x_range, pdf_individual, '--k', label='Components')
plt.legend()
plt.show()

fit using two-component GMM from sklearn I would love it if I could fix the mean of the top component at 0 and only optimize the other mean, the two variances, and the mixing fractions. (Additionally I would love to be able to use a half-normal for the component on the right.) Is there a simple way to do this with built-in functions in python/sklearn, or will I have to build that model myself using some probabilistic programming language?


Solution

  • Afaik, you cannot do exactly what you want in sklearn.

    Imho, basically there are multiple strategies: (i) implement GMM yourself, (ii) switch to another language/framework, (iii) adapt GMM code, or (iv) adapt.


    (i) You probably do not want to do this unless you want to learn for yourself.


    (ii) You could use stan and adapt the code in the last paragraph to have a fixed component of your choice (distribution type and parameters)


    (iii) You could do (i) but slightly adapt the sklearn code or simply use the methods for estimation but with you own slight modifications.


    (iv)


    Hope this helps :-)