I am working in Python. I am using a binary dataframe in which I have a ser of values of 0 and 1 for diferent users at diferent times.
I can perform hierarchical clustering directly from the dataframe as
metodo='average'
clusters = linkage(user_df, method=metodo,metric='hamming')
# Create a dendrogram
plt.figure(figsize=(10, 7))
dendrogram(clusters, labels=user_df.index, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('User')
plt.ylabel('Distance')
# Save the figure
plt.savefig(f'dendrogram_{metodo}_entero.png')
plt.show()
However, I want to separate the calculation of the distance matrix and the clustering. To do that, I have calculated the distance matrix and I have sent it as an argument to the clustering.
dist_matrix = pdist(user_df.values, metric='hamming')
# Convert the distance matrix to a square form
dist_matrix_square = squareform(dist_matrix)
# Create a DataFrame from the distance matrix
dist_df = pd.DataFrame(dist_matrix_square, index=user_df.index, columns=user_df.index)
clusters = linkage(dist_df, method=metodo)
Unfortunately, the results that I obtain are different with both methodologies. As far as I know, the first code is the correct one.
So I don't know if I can calculate the distance matrix and then use it somehow as an argument for clustering.
pdist
returns a numpy array that is the condensed distance matrix. You can pass this form of the distance matrix directly to linkage
. Don't convert it to a Pandas DataFrame.
So your code could be as simple as:
dist_matrix = pdist(user_df.values, metric='hamming')
clusters = linkage(dist_matrix, method=metodo)