I would like to know if it's possible to do hierarchical clustering with different sample size on Python? More precisely, with Ward's minimum variance method.
For instance, I have 5 lists of integers, A, B, C, D, E of different lengths. What I want to do is to group these 5 lists into 3 groups according to Ward's method (the decrease in variance for the cluster being merged).
Does anyone knows how to do so?
We can consider theses 5 lists are your samples you want to cluster in 3 groups. Hierarchical cluster as you may know can take as input distance matrices. Distance matrices evaluate some sort of pairwise distances (or dissimilarities) between your samples.
You have to construct this 5x5 matrix by choosing a meaningful distance function. This greatly depends on what your samples/integers represent. As your samples do not have constant length you can't compute metrics like euclidean distance.
For example if integers in your lists can be interpreted as classes, you could compute Jaccard Index to express some sort of dissimilarity.
[1 2 3 4 5] and [1 3 4] have a Jaccard similarity index of 3/5 (or dissimilarity of 2/5).
0 being entirely different and 1 perfectly identical.
https://en.wikipedia.org/wiki/Jaccard_index
Once your dissimilarity matrix is computed (in fact it represent only 5 choose 2 = 10 different values as this matrix is symmetrical) you can apply hierarchical clustering on it.
The important part being finding the adapted distance function to your problem.