rcluster-analysisk-meansweighting

Weighting k Means Clustering by number of observations


I would like to cluster some data using k Means in R that looks as follows.

ADP NS  CNTR    PP2V    EML PP1V    ADDPS   FB  PP1D    ADR ISV PP2D    ADSEM   SUMALL  CONV
2   0   0   1   0   0   0   0   0   12  0   12  0   53  0
2   0   0   1   0   0   0   0   0   14  0   25  0   53  0
2   0   0   1   0   0   0   0   0   15  0   0   0   53  0
2   0   0   1   0   0   0   0   0   15  0   4   0   53  0
2   0   0   1   0   0   0   0   0   17  0   0   0   53  0
2   0   0   1   0   0   0   0   0   18  0   0   0   106 0
2   0   0   1   0   0   0   0   0   23  0   10  0   53  0
2   0   0   1   0   0   1   0   0   0   0   1   0   106 0
2   0   0   1   0   0   3   0   0   0   0   0   0   53  0
2   0   0   2   0   0   0   0   0   0   0   0   0   3922    0
2   0   0   2   0   0   0   0   0   0   0   1   0   530 0
2   0   0   2   0   0   0   0   0   0   0   2   0   954 0
2   0   0   2   0   0   0   0   0   0   0   3   0   477 0
2   0   0   2   0   0   0   0   0   0   0   4   0   265 0
2   0   0   2   0   0   0   0   0   0   0   5   0   742 0
2   0   0   2   0   0   0   0   0   0   0   6   0   265 0
2   0   0   2   0   0   0   0   0   0   0   7   0   265 0

The column "SUMALL" is the number of times that a particular combination of variables is observed in the data.

So when using k means I would like to be able to use this column as a 'weight' for that particular combination so that the frequent combinations get more importance (also so that the cluster features are given as weighted averages).

I can't see a simple way to do this in the standard cluster package, can anyone advise on whether there is a simple way to do this?


Solution

  • Since SUMALL is the number of times a particular observation occurred, you could create a new dataset where each row is replicated the correct number of times, and then do your clustering with that new dataset.

    Here's a simple example of expanding the dataset for replicate rows

    df<-data.frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(7,8,9,9),SUMALL=c(2,6,4,1))
      a b c SUMALL
    1 1 4 7      2
    2 2 5 8      6
    3 3 6 9      4
    4 4 7 9      1
    

    Then we need to expand df by replicating rows according to SUMALL

    df_expanded<-df[rep(seq_len(nrow(df)),df$SUMALL),]
    
    a b c SUMALL
    1   1 4 7      2
    1.1 1 4 7      2
    2   2 5 8      6
    2.1 2 5 8      6
    2.2 2 5 8      6
    2.3 2 5 8      6
    2.4 2 5 8      6
    2.5 2 5 8      6
    3   3 6 9      4
    3.1 3 6 9      4
    3.2 3 6 9      4
    3.3 3 6 9      4
    4   4 7 9      1
    

    Then use that with your favorite clustering method.