pythonmatplotlibscipyhierarchical-clusteringdendrogram

retrieve leave colors from scipy dendrogram


I can not get the color leaves from the scipy dendrogram dictionary. As stated in the documentation and in this github issue, the color_list key in the dendrogram dictionary refers to the links, not the leaves. It would be nice to have another key referring to the leaves, sometimes you need this for coloring other types of graphics, such as this scatter plot in the example below.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

# DATA EXAMPLE
x = np.array([[ 5, 3],
              [10,15],
              [15,12],
              [24,10],
              [30,30],
              [85,70],
              [71,80]])

# DENDROGRAM
plt.figure()
plt.subplot(121)
z = linkage(x, 'single')
d = dendrogram(z)

# COLORED PLOT
# This is what I would like to achieve. Colors are assigned manually by looking
# at the dendrogram, because I failed to get it from d['color_list'] (it refers 
# to links, not observations)
plt.subplot(122)
points = d['leaves']
colors = ['r','r','g','g','g','g','g']
for point, color in zip(points, colors):
    plt.plot(x[point, 0], x[point, 1], 'o', color=color)

Output from code example above

Manual color assignment seems easy in this example, but I'm dealing with huge datasets, so until we get this new feature in the dictionary (color leaves), I'm trying to infer it somehow with the current information contained in the dictionary but I'm out of ideas so far. Can anyone help me?

Thanks.


Solution

  • The following approach seems to work. The dictionary returned by the dendogram contains 'color_list' with the colors of the linkages. And 'icoord' and 'dcoord' with the x, resp. y, plot coordinates of these linkages. These x-positions are 5, 15, 25, ... when the linkage starts at a point. So, testing these x-positions can bring us back from the linkage to the corresponding point. And allows to assign the color of the linkage to the point.

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.cluster.hierarchy import linkage, dendrogram
    
    # DATA EXAMPLE
    x = np.random.uniform(0, 10, (20, 2))
    
    # DENDROGRAM
    plt.figure()
    plt.subplot(121)
    z = linkage(x, 'single')
    d = dendrogram(z)
    plt.yticks([])
    
    # COLORED PLOT
    plt.subplot(122)
    points = d['leaves']
    colors = ['none'] * len(points)
    for xs, c in zip(d['icoord'], d['color_list']):
        for xi in xs:
            if xi % 10 == 5:
                colors[(int(xi)-5) // 10] = c
    for point, color in zip(points, colors):
        plt.plot(x[point, 0], x[point, 1], 'o', color=color)
        plt.text(x[point, 0], x[point, 1], f' {point}')
    plt.show()
    

    example plot

    PS: This post about matching points with their clusters might also be relevant.