I have a pandas dataframe defined as
Alejandro Ana Beatriz Jose Juan Luz Maria Ruben
Alejandro 0.0 0.0 1000.0 0.0 1037.0 1014.0 100.0 0.0
Ana 0.0 0.0 15.0 0.0 100.0 0.0 16.0 1100.0
Beatriz 1000.0 15.0 0.0 100.0 1000.0 1100.0 15.0 0.0
Jose 0.0 0.0 100.0 0.0 0.0 100.0 1000.0 14.0
Juan 1037.0 100.0 1000.0 0.0 0.0 1014.0 0.0 100.0
Luz 1014.0 0.0 1100.0 100.0 1014.0 0.0 0.0 0.0
Maria 100.0 16.0 15.0 1000.0 0.0 0.0 0.0 0.0
Ruben 0.0 1100.0 0.0 14.0 100.0 0.0 0.0 0.0
This dataframe contains compatibility measurements I want to group these people in groups of two or three people. This could be [Alejandro, Ana, Beatriz], [Jose, Juan, Luz], [Maria, Ruben].
To do that, I have to maximize the compatibility inside each group.
Can someone please point me towards a methodology?
It looks like you are starting off with a distance matrix rather than the original sample values. AgglomerativeClustering
can work with the distance matrix to group the samples into however many clusters you specify (other algorithms directly accepting a precomputed distance matrix include DBSCAN
, HDBSCAN
, SpectralClustering
, and OPTICS
).
In the code below, I ran AgglomerativeClustering
on the data to get the cluster assigned to each name. Then, for visualisation, I represented the original distance matrix in 2D, and coloured the points by their assigned cluster.
The data:
import pandas as pd
import matplotlib.pyplot as plt
#Data: distance matrix
data = {
'Alejandro': [0.0, 0.0, 1000.0, 0.0, 1037.0, 1014.0, 100.0, 0.0],
'Ana': [0.0, 0.0, 15.0, 0.0, 100.0, 0.0, 16.0, 1100.0],
'Beatriz': [1000.0, 15.0, 0.0, 100.0, 1000.0, 1100.0, 15.0, 0.0],
'Jose': [0.0, 0.0, 100.0, 0.0, 0.0, 100.0, 1000.0, 14.0],
'Juan': [1037.0, 100.0, 1000.0, 0.0, 0.0, 1014.0, 0.0, 100.0],
'Luz': [1014.0, 0.0, 1100.0, 100.0, 1014.0, 0.0, 0.0, 0.0],
'Maria': [100.0, 16.0, 15.0, 1000.0, 0.0, 0.0, 0.0, 0.0],
'Ruben': [0.0, 1100.0, 0.0, 14.0, 100.0, 0.0, 0.0, 0.0]
}
df = pd.DataFrame(
data,
index=['Alejandro', 'Ana', 'Beatriz', 'Jose', 'Juan', 'Luz', 'Maria', 'Ruben']
)
df
Perform the clustering:
#
#Cluster based on distances
#
from sklearn.cluster import AgglomerativeClustering
n_clusters = 3
clusterer = AgglomerativeClustering(
n_clusters=n_clusters, metric='precomputed', linkage='average'
)
label_foreach_name = clusterer.fit(df).labels_
#View results as a table
cluster_info = pd.DataFrame(
{'name': df.columns,
'cluster': label_foreach_name}
).set_index('cluster').sort_index()
display(
cluster_info
)
Visualise in a 2D coordinate space:
#
# Visualisation
# Represent in 2D, coloured by cluster
#
#Invert distance matrix to 2D coordinate space
from sklearn.manifold import MDS
mds = MDS(n_components=2, dissimilarity='precomputed', random_state=0)
coords_2d = mds.fit_transform(df)
#Plot points, coloured by assigned cluster
import matplotlib.pyplot as plt
cluster_cmap = plt.get_cmap('Set1', n_clusters)
f, ax = plt.subplots(figsize=(4, 4))
ax.scatter(coords_2d[:, 0], coords_2d[:, 1],
c=cluster_cmap(label_foreach_name),
marker='s', s=100)
ax.set_xlabel('x')
ax.set_ylabel('y')
f.suptitle('Inversion of the distance matrix to 2D', fontsize=11)
ax.set_title('Samples coloured by cluster', fontsize=10)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
#Add text annotations
for row_num, name in enumerate(df.columns):
ax.annotate(name, xy=coords_2d[row_num], xytext=(10, -5), textcoords='offset pixels')
#Add legend
cluster_colours = plt.get_cmap
for colour, label in zip(cluster_cmap(range(n_clusters)),
[f'cluster {i}' for i in range(n_clusters)]):
ax.scatter([], [], marker='s', s=50, c=colour, label=label)
f.legend(bbox_to_anchor=(1.3, 1))
Updated solution OP requires that cluster sizes are limited to 2 or 3, i.e. bounded between user-defined values. Initially I tried HDBSCAN
as it accepts a min and max specification for cluster sizes, but it failed with this small dataset (more info at the bottom).
My attempt below runs lots of KMeans
trials to find a clustering that is suitable. It stops when it finds a clustering that contains no bad clusters (a bad cluster is where the size doesn't meet the user-defined spec). A downside of this approach is that the quality of the clustering might be poor or variable as we are relying on random initialisations of KMeans.
from sklearn.cluster import AgglomerativeClustering, KMeans
import numpy as np
#Specify the constraints
min_cluster_size = 2
max_cluster_size = 3
max_tries = 400
#Use constraints to determine acceptable cluster sizes
acceptable_cluster_sizes = np.arange(min_cluster_size, max_cluster_size + 1)
nclusters_to_try = np.arange(2, len(df) + 1)
# nclusters_to_try = np.arange(len(df) // min_cluster_size, 2, step=-1)
rand_state = np.random.RandomState(0)
labels = None
for n_clusters in nclusters_to_try:
if labels is not None:
break
for attempt in range(max_tries):
clusterer = KMeans(
n_clusters=n_clusters,
init='random',
n_init=1,
max_iter=300,
random_state=rand_state
).fit(df)
clusters, counts = np.unique(clusterer.labels_, return_counts=True)
n_good_clusters = np.isin(counts, acceptable_cluster_sizes).sum()
n_bad_clusters = n_clusters - n_good_clusters
print(
f'n_clusters: {n_clusters:>2d}',
f' attempt {attempt + 1:>4d}/{max_tries:>4d}',
f'{n_bad_clusters:>2d} bad clusters',
end='\r' if attempt + 1 < max_tries else '\n'
)
if n_bad_clusters == 0:
labels = clusterer.labels_
break
if labels is None:
raise TimeoutError(f'No solution found when max_tries={max_tries}')
#View results as a table
cluster_info = pd.DataFrame(
{'name': df.columns,
'cluster': labels}
).set_index('cluster').sort_index()
display(
cluster_info
)
Failed, but worth trying on more data: Initially I tried HDBSCAN
as it takes both a min_cluster_size=
and max_cluster_size=
parameter. However, it was flagging all the samples as anomalies. You might have better luck with a larger dataset:
#Using HDBSCAN
from sklearn.cluster import HDBSCAN
min_cluster_size = 2
max_cluster_size = 3
clusterer = HDBSCAN(
min_cluster_size=min_cluster_size,
max_cluster_size=max_cluster_size,
metric='precomputed',
)
label_foreach_name = clusterer.fit(df).labels_
#View results as a table
cluster_info = pd.DataFrame(
{'name': df.columns,
'cluster': label_foreach_name}
).set_index('cluster').sort_index()
cluster_info