pythonopenturns

How to estimate the parameters of a mixture model in OpenTURNS?


I would like to estimate the parameters of a mixture model of normal distributions in OpenTURNS (that is, the distribution of a weighted sum of Gaussian random variables). OpenTURNS can create such a mixture, but it cannot estimate its parameters. Moreover, I need to create the mixture as an OpenTURNS distribution in order to propagate uncertainty through a function.

For example, I know how to create a mixture of two normal distributions:

import openturns as ot
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)

In this example, I would like to estimate mu1, sigma1, mu2, sigma2 on a given sample. In order to create a working example, it is easy to generate a sample by simulation.

s = m.getSample(100)

Solution

  • You can rely on scikit-learn's GaussianMixture to estimate the parameters and then use them to define a Mixture model in OpenTURNS.

    The script hereafter contains a Python class MixtureFactory that estimates the parameters of a scikitlearn GaussianMixture and outputs an OpenTURNS Mixture distribution:

    from sklearn.mixture import GaussianMixture
    from sklearn.utils.validation import check_is_fitted
    import openturns as ot
    import numpy as np
    
    class MixtureFactory(GaussianMixture):
        """
        Representation of a Gaussian mixture model probability distribution.
    
        This class allows to estimate the parameters of a Gaussian mixture
        distribution using scikit algorithms & provides openturns Mixture object.
        
        Read more in scikit learn user guide & openturns theory.
    
        Parameters:
        -----------
        n_components : int, defaults to 1.
            The number of mixture components.
    
        covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}
            String describing the type of covariance parameters to use.
            Must be one of:
    
            'full'
                each component has its own general covariance matrix
            'tied'
                all components share the same general covariance matrix
            'diag'
                each component has its own diagonal covariance matrix
            'spherical'
                each component has its own single variance
    
        tol : float, defaults to 1e-3.
            The convergence threshold. EM iterations will stop when the
            lower bound average gain is below this threshold.
    
        reg_covar : float, defaults to 1e-6.
            Non-negative regularization added to the diagonal of covariance.
            Allows to assure that the covariance matrices are all positive.
    
        max_iter : int, defaults to 100.
            The number of EM iterations to perform.
    
        n_init : int, defaults to 1.
            The number of initializations to perform. The best results are kept.
    
        init_params : {'kmeans', 'random'}, defaults to 'kmeans'.
            The method used to initialize the weights, the means and the
            precisions.
            Must be one of::
    
                'kmeans' : responsibilities are initialized using kmeans.
                'random' : responsibilities are initialized randomly.
    
        weights_init : array-like, shape (n_components, ), optional
            The user-provided initial weights, defaults to None.
            If it None, weights are initialized using the `init_params` method.
    
        means_init : array-like, shape (n_components, n_features), optional
            The user-provided initial means, defaults to None,
            If it None, means are initialized using the `init_params` method.
    
        precisions_init : array-like, optional.
            The user-provided initial precisions (inverse of the covariance
            matrices), defaults to None.
            If it None, precisions are initialized using the 'init_params' method.
            The shape depends on 'covariance_type'::
    
                (n_components,)                        if 'spherical',
                (n_features, n_features)               if 'tied',
                (n_components, n_features)             if 'diag',
                (n_components, n_features, n_features) if 'full'
    
        random_state : int, RandomState instance or None, optional (default=None)
            If int, random_state is the seed used by the random number generator;
            If RandomState instance, random_state is the random number generator;
            If None, the random number generator is the RandomState instance used
            by `np.random`.
    
        warm_start : bool, default to False.
            If 'warm_start' is True, the solution of the last fitting is used as
            initialization for the next call of fit(). This can speed up
            convergence when fit is called several times on similar problems.
            In that case, 'n_init' is ignored and only a single initialization
            occurs upon the first call.
            See :term:`the Glossary <warm_start>`.
    
        verbose : int, default to 0.
            Enable verbose output. If 1 then it prints the current
            initialization and each iteration step. If greater than 1 then
            it prints also the log probability and the time needed
            for each step.
    
        verbose_interval : int, default to 10.
            Number of iteration done before the next print.
        """
        def __init__(self, n_components=2, covariance_type='full', tol=1e-6,
                     reg_covar=1e-6, max_iter=1000, n_init=1, init_params='kmeans',
                     weights_init=None, means_init=None, precisions_init=None,
                     random_state=41, warm_start=False,
                     verbose=0, verbose_interval=10):
            super().__init__(n_components, covariance_type, tol, reg_covar,
                             max_iter, n_init, init_params, weights_init, means_init,
                             precisions_init, random_state, warm_start, verbose, verbose_interval)
        def fit(self, X):
            """
            Fit the mixture model parameters.
    
            EM algorithm is applied here to estimate the model parameters and build a
            Mixture distribution (see openturns mixture). 
            The method fits the model ``n_init`` times and sets the parameters with
            which the model has the largest likelihood or lower bound. Within each
            trial, the method iterates between E-step and M-step for ``max_iter``
            times until the change of likelihood or lower bound is less than
            ``tol``, otherwise, a ``ConvergenceWarning`` is raised.
            If ``warm_start`` is ``True``, then ``n_init`` is ignored and a single
            initialization is performed upon the first call. Upon consecutive
            calls, training starts where it left off.
    
            Parameters
            ----------
            X : array-like, shape (n_samples, n_features)
                List of n_features-dimensional data points. Each row
                corresponds to a single data point.
    
            Returns
            -------
            """
            data = np.array(X)
            # Evaluate the model parameters.
            super().fit(data)
            # openturns mixture
        
            # n_components ==> weight of size n_components
            weights = self.weights_
            n_components = len(weights)
            # Create ot distribution
            collection = n_components * [0]
            # Covariance matrices
            cov = self.covariances_
            mu = self.means_
            # means : n_components x n_features
            n_components, n_features = mu.shape
    
            # Following the type of covariance, we define the collection of gaussians
    
            # Spherical : C_k = Identity * sigma_k
            if self.covariance_type is 'spherical':
                c = ot.CorrelationMatrix(n_features)
                for l in range(n_components):
                    sigma = np.sqrt(cov[l])
                    collection[l] = ot.Normal(list(mu[l]), [ sigma ] * n_features  , c)
    
            elif self.covariance_type is 'diag' :
                for l in range(n_components):
                    c = ot.CovarianceMatrix(n_features)
                    for i in range(n_features):
                        c[i,i] = cov[l, i]
                    collection[l] = ot.Normal(list(mu[l]), c)
    
            elif self.covariance_type == 'tied':
                # Same covariance for all clusters
                c = ot.CovarianceMatrix(n_features)
                for i in range(n_features):
                    for j in range(0, i+1):
                        c[i,j] = cov[i,j]
                # Define the collection with the same covariance
                for l in range(n_components):
                    collection[l] = ot.Normal(list(mu[l]), c)
                
            else:
                n_features = cov.shape[1]
                for l in range(n_components):
                    c = ot.CovarianceMatrix(n_features)
                    for i in range(n_features):
                        for j in range(0, i+1):
                            c[i,j] = cov[l][i,j]
                    collection[l] = ot.Normal(list(mu[l]), c)
    
            self._mixture = ot.Mixture(collection, weights)
            return self
    
        def get_mixture(self):
            """
            Returns the Mixture object
            """
            check_is_fitted(self)
            return self._mixture
    
    if __name__ == "__main__":
        mu1 = 1.0
        sigma1 = 0.5
        mu2 = 3.0
        sigma2 = 2.0
        weights = [0.3, 0.7]
        n1 = ot.Normal(mu1, sigma1)
        n2 = ot.Normal(mu2, sigma2)
        m = ot.Mixture([n1, n2], weights)
        x = m.getSample(1000)
        est_dist = MixtureFactory(random_state=1)
        est_dist.fit(x)
        print(est_dist.get_mixture())