pythonplotcluster-analysisk-meansxmeans

pyclustering visualising xmeans when the matrix has more than three dimensions


I'm trying to cluster and visualise some data with xmeans from the pyclustering lib. I copied the code directly from the example in the documentation,

from pyclustering.cluster import cluster_visualizer
from pyclustering.cluster.xmeans import xmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import SIMPLE_SAMPLES
sample = X # read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)    
# Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will
# start analysis.
amount_initial_centers = 2
initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize()
# Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum
# number of clusters that can be allocated is 20.
xmeans_instance = xmeans(sample, initial_centers, 20)
xmeans_instance.process()
# Extract clustering results: clusters and their centers
clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()
# Print total sum of metric errors
print("Total WCE:", xmeans_instance.get_total_wce())
# Visualize clustering results
visualizer = cluster_visualizer()
visualizer.append_clusters(clusters, sample)
visualizer.append_cluster(centers, None, marker='*', markersize=10)
visualizer.show()

The only difference is that I assigned to sample the value of my matrix X instead of loading a sample dataset.

When I try to visualise the clustering result I get this error:

Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.

My X matrix is generated in this way:

features = ["I", "Iu", other 7 column names]
data = df[features]
...
X = scaler.fit_transform(data)

Is there a way to visualise the clusters and plotting only two/three features at a time?

I can't find anything on the documentation.

I tried this:

visualizer.append_clusters(clusters, sample[:,[0,1]])

in order to visualise only the first two features and got this error

Only clusters with the same dimension of objects can be displayed on canvas.

EDIT:

I updated the code as suggested in the answer by annoviko but now I get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-69-6fd7d2ce5fcd> in <module>
     20 visualizer.append_clusters(clusters, X)
     21 visualizer.append_cluster(centers, None, marker='*', markersize=10)
---> 22 visualizer.show(pair_filter=[[0, 1], [0, 2]])

/usr/local/lib/python3.8/site-packages/pyclustering/cluster/__init__.py in show(self, pair_filter, **kwargs)
    224             raise ValueError("There is no non-empty clusters for visualization.")
    225 
--> 226         cluster_data = self.__clusters[0].data or self.__clusters[0].cluster
    227         dimension = len(cluster_data[0])
    228 

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

It is raised by visualizer.show(), and it happens even if I remove the pair_filter from within the function call.


Solution

  • In line with the error that you got:

    Only objects with size dimension 1 (1D plot), 2 (2D plot) or 3 (3D plot) can be displayed. For multi-dimensional data use 'cluster_visualizer_multidim'.

    You have to use cluster_visualizer_multidim as it was mentioned. There is a documentation (pyclustering 0.10.1) with an example: https://pyclustering.github.io/docs/0.10.1/html/dc/d6b/classpyclustering_1_1cluster_1_1cluster__visualizer__multidim.html

    For example, if you have a data (D > 3) and you want to display (x0, x1) and (x0, x2) then you can display it in the following way:

    visualizer = cluster_visualizer_multidim()
    visualizer.append_clusters(clusters, sample_4d)
    visualizer.show(pair_filter=[[0, 1], [0, 2]])
    

    Where pair_filter specifies which features should be shown. In example above, it will show only (x0, x1) - [0, 1] and (x0, x2) - [0, 2].

    So, in your particular case where you have to display only first two features it should be:

    visualizer = cluster_visualizer_multidim()
    visualizer.append_clusters(clusters, sample)
    visualizer.show(pair_filter=[[0, 1]])
    

    I think I have to make error more readable and make a proposal to use another class in the first sentence. Let me know if it helps (if it is still relevant for you).