pythonscikit-learncluster-analysishierarchical-clustering

How to use the Agglomerative Clustering algorithm from scikit-learn python library with a declared number of objects in the cluster?


I use the scikit-learn Agglomerative Clustering python library in my code to automatically cluster points and place a new, larger point in the center of the cluster. I have a set of several thousand points with X and Y coordinates contained in a DataFrame. Then I want to use Agglomerative Clustering, but when setting the parameters I can only use n_clusters to set the resulting number of clusters or distance_threshold to set the maximum clustering distance. I would like to set the target number of points in each cluster, e.g. 200, so that each resulting cluster would have 200 points. It would also be good to assume a certain clustering error, i.e. clusters could have from 170 to 230 points. Is there a parameter that would help me? Or should I write a function that would join too small clusters to others and divide too large ones (or insert two centers in them)? Maybe I should use a different clustering algorithm?

I generate a dataframe with 5400 points and then try to cluster them. However, I am not sure if I am using Agglomerative Clustering correctly and I do not know how to solve the problem of not being able to set the target number of points in the clusters. Below is my code and the current clustering result:

np.random.seed(0)
n_points = 5400
df = pd.DataFrame({
    'X': np.random.randn(n_points),
    'Y': np.random.randn(n_points),
})

model = AgglomerativeClustering(
    n_clusters=27,
    distance_threshold=None,
    linkage="ward"
)

clustered = model.fit_predict(df)
df['clustered'] = clustered

centroids = df.groupby('clustered').agg({'X': 'mean', 'Y': 'mean'}).reset_index()
centroids_df = pd.DataFrame({
    'X': centroids['X'],
    'Y': centroids['Y'],
    'Clustered': centroids['clustered']
})

plt.scatter(df['X'], df['Y'], c=clustered, s=1)
plt.scatter(centroids_df['X'], centroids_df['Y'], c='black', s=200)
plt.show()

enter image description here


Solution

  • it's impossible to have an equal sized cluster with agglomerative clustering, because the merging process depends solely on the distances between clusters, there's no built-in mechanism to ensure equal-sized clusters.

    if you want an equal size cluster try another clustering algorithle like this one : https://github.com/anamabo/Equal-Size-Spectral-Clustering