I have a data set as follows: (i took a simple example but the real data set is much bigger)
V1 V2 V3 V4
1 1 0 0 1
2 0 1 1 0
3 0 0 1 0
4 1 1 1 1
5 0 1 1 0
6 1 0 0 1
7 0 0 0 1
8 0 1 1 1
9 1 0 1 0
10 0 1 1 0
...
where V1, V2,V3...Vn are items and 1,2,3,4...1000 are transactions. I want to partition these items into k clusters such that in each cluster i have the items that appear the most frequently together in the same transactions. To determine the number times each couple of items appear together i tried crosstable, i got the following results:
V1 V2 V3 V4
V1 4 1 2 3
V2 1 5 5 2
V3 2 5 7 2
V4 3 2 2 5
For this small example if i want to create 2 clusters (k=2) such that a cluster must contain 2 items (to maintain the balance between clusters), i will get:
Cluster1={V1,V4}
Cluster2={V2,V3}
because:
1) V1 appears more frequently with V4 (V1,V4)=3 > (V1,V3) > (V1,V2) and same for V4.
2) V2 appears more frequently with V2 (V2,V3)=5 > (V2,V4) > (V2, V1) and same for V3.
How can i do this partition with R and for a bigger set of data ?
I think you are asking about clustering. It is not quite the same as what you are doing above, but you could use hclust
to look for similarity between variables with a suitable distance measure.
For example
plot(hclust(dist(t(df),method="binary")))
produces the following...
You should look at ?dist
to see if this distance measure is meaningful in your context, and ?hclust
for other things you can do once you have your dendrogram (such as identifying clusters).
Or you could use your crosstab as a distance matrix (perhaps take the reciprocal of the values, and then as.dist
).