I would like to cluster some data using k Means in R that looks as follows.
ADP NS CNTR PP2V EML PP1V ADDPS FB PP1D ADR ISV PP2D ADSEM SUMALL CONV
2 0 0 1 0 0 0 0 0 12 0 12 0 53 0
2 0 0 1 0 0 0 0 0 14 0 25 0 53 0
2 0 0 1 0 0 0 0 0 15 0 0 0 53 0
2 0 0 1 0 0 0 0 0 15 0 4 0 53 0
2 0 0 1 0 0 0 0 0 17 0 0 0 53 0
2 0 0 1 0 0 0 0 0 18 0 0 0 106 0
2 0 0 1 0 0 0 0 0 23 0 10 0 53 0
2 0 0 1 0 0 1 0 0 0 0 1 0 106 0
2 0 0 1 0 0 3 0 0 0 0 0 0 53 0
2 0 0 2 0 0 0 0 0 0 0 0 0 3922 0
2 0 0 2 0 0 0 0 0 0 0 1 0 530 0
2 0 0 2 0 0 0 0 0 0 0 2 0 954 0
2 0 0 2 0 0 0 0 0 0 0 3 0 477 0
2 0 0 2 0 0 0 0 0 0 0 4 0 265 0
2 0 0 2 0 0 0 0 0 0 0 5 0 742 0
2 0 0 2 0 0 0 0 0 0 0 6 0 265 0
2 0 0 2 0 0 0 0 0 0 0 7 0 265 0
The column "SUMALL" is the number of times that a particular combination of variables is observed in the data.
So when using k means I would like to be able to use this column as a 'weight' for that particular combination so that the frequent combinations get more importance (also so that the cluster features are given as weighted averages).
I can't see a simple way to do this in the standard cluster
package, can anyone advise on whether there is a simple way to do this?
Since SUMALL
is the number of times a particular observation occurred, you could create a new dataset where each row is replicated the correct number of times, and then do your clustering with that new dataset.
Here's a simple example of expanding the dataset for replicate rows
df<-data.frame(a=c(1,2,3,4),b=c(4,5,6,7),c=c(7,8,9,9),SUMALL=c(2,6,4,1))
a b c SUMALL
1 1 4 7 2
2 2 5 8 6
3 3 6 9 4
4 4 7 9 1
Then we need to expand df
by replicating rows according to SUMALL
df_expanded<-df[rep(seq_len(nrow(df)),df$SUMALL),]
a b c SUMALL
1 1 4 7 2
1.1 1 4 7 2
2 2 5 8 6
2.1 2 5 8 6
2.2 2 5 8 6
2.3 2 5 8 6
2.4 2 5 8 6
2.5 2 5 8 6
3 3 6 9 4
3.1 3 6 9 4
3.2 3 6 9 4
3.3 3 6 9 4
4 4 7 9 1
Then use that with your favorite clustering method.