I've got a documentTermMatrix that looks as follows:
artikel naam product personeel loon verlof
doc 1 1 1 2 1 0 0
doc 2 1 1 1 0 0 0
doc 3 0 0 1 1 2 1
doc 4 0 0 0 1 1 1
In the package tm
, it's possible to calculate the hamming distance between 2 documents. But now I want to cluster all the documents that have a hamming distance smaller than 3.
So here I would like that cluster 1 is document 1 and 2, and that cluster 2 is document 3 and 4. Is there a possibility to do that?
I saved your table to myData
:
myData
artikel naam product personeel loon verlof
doc1 1 1 2 1 0 0
doc2 1 1 1 0 0 0
doc3 0 0 1 1 2 1
doc4 0 0 0 1 1 1
Then used hamming.distance()
function from e1071
library. You can use your own distances (as long as they are in the matrix form)
lilbrary(e1071)
distMat <- hamming.distance(myData)
Followed by hierarchical clustering using "complete" linkage method to make sure that the maximum distance within one cluster could be specified later.
dendrogram <- hclust(as.dist(distMat), method="complete")
Select groups according to the maximum distance between points in a group (maximum = 5)
groups <- cutree(dendrogram, h=5)
Finally plot the results:
plot(dendrogram) # main plot
points(c(-100, 100), c(5,5), col="red", type="l", lty=2) # add cutting line
rect.hclust(dendrogram, h=5, border=c(1:length(unique(groups)))+1) # draw rectangles
Another way to see the cluster membership for each document is with table
:
table(groups, rownames(myData))
groups doc1 doc2 doc3 doc4
1 1 1 0 0
2 0 0 1 1
So documents 1st and 2nd fall into one group while 3rd and 4th - to another group.