machine-learningscikit-learnexpectation-maximizationyellowbrick

The supplied model is not a clustering estimator in YellowBrick


I am trying to visualize an elbow plot for my data using YellowBrick's KElbowVisualizer and SKLearn's Expectation Maximization algorithm class: GaussianMixture.

When I run this, I get the error in the title. (I have also tried ClassificationReport, but that fails as well)

model = GaussianMixture()

data = get_data(data_name, preprocessor_name, train_split=0.75)
X, y, x_test, y_test = data

visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

I cannot find anything in YellowBrick to help me estimate the number of components for expectation maximization.


Solution

  • Yellowbrick uses the sklearn estimator type checks to determine if a model is well suited to the visualization. You can use the force_model param to bypasses the type checking (though it seems that the KElbow documentation needs to be updated with this).

    However, even though force_model=True gets you through the YellowbrickTypeError it still does not mean that GaussianMixture works with KElbow. This is because the elbow visualizer is set up to work with the centroidal clustering API and requires both a n_clusters hyperparam and a labels_ learned param. Expectation maximization models do not support this API.

    However, it is possible to create a wrapper around the Gaussian mixture model that will allow it to work with the elbow visualizer (and a similar method could be used with the classification report as well).

    from sklearn.base import ClusterMixin
    from sklearn.mixture import GaussianMixture
    from yellowbrick.cluster import KElbow
    from yellowbrick.datasets import load_nfl
    
    class GMClusters(GaussianMixture, ClusterMixin):
    
        def __init__(self, n_clusters=1, **kwargs):
            kwargs["n_components"] = n_clusters
            super(GMClusters, self).__init__(**kwargs)
    
        def fit(self, X):
            super(GMClusters, self).fit(X)
            self.labels_ = self.predict(X)
            return self 
    
    
    X, _ = load_nfl()
    oz = KElbow(GMClusters(), k=(4,12), force_model=True)
    oz.fit(X)
    oz.show()
    

    This does produce a KElbow plot (though not a great one for this particular dataset):

    KElbow with distortion score

    Another answer mentioned Calinksi Harabasz scores, which you can use in the KElbow visualizer as follows:

    oz = KElbow(GMClusters(), k=(4,12), metric='calinski_harabasz', force_model=True)
    oz.fit(X)
    oz.show()
    

    Creating the wrapper isn't ideal, but for model types that don't fit the standard classifier or clusterer sklearn APIs, they are often necessary and it's a good strategy to have in your back pocket for a number of ML tasks.