pythonpandasoptimizationcluster-analysisheuristics

Looking for ways of clustering data acording to distance


I have a pandas dataframe defined as

               Alejandro Ana  Beatriz    Jose    Juan     Luz   Maria   Ruben                                                          
Alejandro        0.0     0.0   1000.0     0.0  1037.0  1014.0   100.0     0.0
Ana              0.0     0.0     15.0     0.0   100.0     0.0    16.0  1100.0
Beatriz       1000.0    15.0      0.0   100.0  1000.0  1100.0    15.0     0.0
Jose             0.0     0.0    100.0     0.0     0.0   100.0  1000.0    14.0
Juan          1037.0   100.0   1000.0     0.0     0.0  1014.0     0.0   100.0
Luz           1014.0     0.0   1100.0   100.0  1014.0     0.0     0.0     0.0
Maria          100.0    16.0     15.0  1000.0     0.0     0.0     0.0     0.0
Ruben            0.0  1100.0      0.0    14.0   100.0     0.0     0.0     0.0

This dataframe contains compatibility measurements I want to group these people in groups of two or three people. This could be [Alejandro, Ana, Beatriz], [Jose, Juan, Luz], [Maria, Ruben].

To do that, I have to maximize the compatibility inside each group.

Can someone please point me towards a methodology?


Solution

  • It looks like you are starting off with a distance matrix rather than the original sample values. AgglomerativeClustering can work with the distance matrix to group the samples into however many clusters you specify (other algorithms directly accepting a precomputed distance matrix include DBSCAN, HDBSCAN, SpectralClustering, and OPTICS).

    In the code below, I ran AgglomerativeClustering on the data to get the cluster assigned to each name. Then, for visualisation, I represented the original distance matrix in 2D, and coloured the points by their assigned cluster.

    enter image description here

    The data:

    import pandas as pd
    import matplotlib.pyplot as plt
    
    #Data: distance matrix
    data = {
        'Alejandro': [0.0, 0.0, 1000.0, 0.0, 1037.0, 1014.0, 100.0, 0.0],
        'Ana': [0.0, 0.0, 15.0, 0.0, 100.0, 0.0, 16.0, 1100.0],
        'Beatriz': [1000.0, 15.0, 0.0, 100.0, 1000.0, 1100.0, 15.0, 0.0],
        'Jose': [0.0, 0.0, 100.0, 0.0, 0.0, 100.0, 1000.0, 14.0],
        'Juan': [1037.0, 100.0, 1000.0, 0.0, 0.0, 1014.0, 0.0, 100.0],
        'Luz': [1014.0, 0.0, 1100.0, 100.0, 1014.0, 0.0, 0.0, 0.0],
        'Maria': [100.0, 16.0, 15.0, 1000.0, 0.0, 0.0, 0.0, 0.0],
        'Ruben': [0.0, 1100.0, 0.0, 14.0, 100.0, 0.0, 0.0, 0.0]
    }
    
    df = pd.DataFrame(
        data,
        index=['Alejandro', 'Ana', 'Beatriz', 'Jose', 'Juan', 'Luz', 'Maria', 'Ruben']
    )
    
    df
    

    Perform the clustering:

    #
    #Cluster based on distances
    #
    from sklearn.cluster import AgglomerativeClustering
    
    n_clusters = 3
    clusterer = AgglomerativeClustering(
        n_clusters=n_clusters, metric='precomputed', linkage='average'
    )
    label_foreach_name = clusterer.fit(df).labels_
    
    #View results as a table
    cluster_info = pd.DataFrame(
        {'name': df.columns,
         'cluster': label_foreach_name}
    ).set_index('cluster').sort_index()
    
    display(
        cluster_info
    )
    
    

    Visualise in a 2D coordinate space:

    #
    # Visualisation
    # Represent in 2D, coloured by cluster
    #
    
    #Invert distance matrix to 2D coordinate space
    from sklearn.manifold import MDS
    
    mds = MDS(n_components=2, dissimilarity='precomputed', random_state=0)
    coords_2d = mds.fit_transform(df)
    
    #Plot points, coloured by assigned cluster
    import matplotlib.pyplot as plt
    cluster_cmap = plt.get_cmap('Set1', n_clusters)
    
    f, ax = plt.subplots(figsize=(4, 4))
    ax.scatter(coords_2d[:, 0], coords_2d[:, 1],
               c=cluster_cmap(label_foreach_name),
               marker='s', s=100)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    f.suptitle('Inversion of the distance matrix to 2D', fontsize=11)
    ax.set_title('Samples coloured by cluster', fontsize=10)
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    
    #Add text annotations
    for row_num, name in enumerate(df.columns):
        ax.annotate(name, xy=coords_2d[row_num], xytext=(10, -5), textcoords='offset pixels')
    
    #Add legend
    cluster_colours = plt.get_cmap
    for colour, label in zip(cluster_cmap(range(n_clusters)),
                             [f'cluster {i}' for i in range(n_clusters)]):
        ax.scatter([], [], marker='s', s=50, c=colour, label=label)
    f.legend(bbox_to_anchor=(1.3, 1))
    

    Updated solution OP requires that cluster sizes are limited to 2 or 3, i.e. bounded between user-defined values. Initially I tried HDBSCAN as it accepts a min and max specification for cluster sizes, but it failed with this small dataset (more info at the bottom).

    My attempt below runs lots of KMeans trials to find a clustering that is suitable. It stops when it finds a clustering that contains no bad clusters (a bad cluster is where the size doesn't meet the user-defined spec). A downside of this approach is that the quality of the clustering might be poor or variable as we are relying on random initialisations of KMeans.

    enter image description here

    from sklearn.cluster import AgglomerativeClustering, KMeans
    import numpy as np
    
    #Specify the constraints
    min_cluster_size = 2
    max_cluster_size = 3
    max_tries = 400
    
    #Use constraints to determine acceptable cluster sizes
    acceptable_cluster_sizes = np.arange(min_cluster_size, max_cluster_size + 1)
    
    nclusters_to_try = np.arange(2, len(df) + 1)
    # nclusters_to_try = np.arange(len(df) // min_cluster_size, 2, step=-1)
    
    rand_state = np.random.RandomState(0)
    
    labels = None
    for n_clusters in nclusters_to_try:
        if labels is not None:
            break
        
        for attempt in range(max_tries):
            clusterer = KMeans(
                n_clusters=n_clusters,
                init='random',
                n_init=1,
                max_iter=300,
                random_state=rand_state
            ).fit(df)
            
            clusters, counts = np.unique(clusterer.labels_, return_counts=True)
            n_good_clusters = np.isin(counts, acceptable_cluster_sizes).sum()
            n_bad_clusters = n_clusters - n_good_clusters
            
            print(
                f'n_clusters: {n_clusters:>2d}',
                f'  attempt {attempt + 1:>4d}/{max_tries:>4d}',
                f'{n_bad_clusters:>2d} bad clusters',
                end='\r' if attempt + 1 < max_tries else '\n'
            )
            
            if n_bad_clusters == 0:
                labels = clusterer.labels_
                break
    
    if labels is None:
        raise TimeoutError(f'No solution found when max_tries={max_tries}')
    
    #View results as a table
    cluster_info = pd.DataFrame(
        {'name': df.columns,
         'cluster': labels}
    ).set_index('cluster').sort_index()
    
    display(
        cluster_info
    )
    
    
    

    Failed, but worth trying on more data: Initially I tried HDBSCAN as it takes both a min_cluster_size= and max_cluster_size= parameter. However, it was flagging all the samples as anomalies. You might have better luck with a larger dataset:

    #Using HDBSCAN
    from sklearn.cluster import HDBSCAN
    
    min_cluster_size = 2
    max_cluster_size = 3
    
    clusterer = HDBSCAN(
        min_cluster_size=min_cluster_size,
        max_cluster_size=max_cluster_size,
        metric='precomputed',
    )
    
    label_foreach_name = clusterer.fit(df).labels_
    
    #View results as a table
    cluster_info = pd.DataFrame(
        {'name': df.columns,
         'cluster': label_foreach_name}
    ).set_index('cluster').sort_index()
    
    cluster_info