I am a beginner in clustering, and I have a binary matrix in which each student have the sessions they are enrolled in. I want to cluster students with same sessions.
clustering methods are so many and varies according to the dataset
for exemple k-means is not appropriate, because the data is binary and the standard "mean" operation does not make much sense for binary.
i'm open to any suggestion
Here's an example:
+------------+---------+--------+--------+
| session1 | session2|session3|session4|
+------------+---------+--------+--------+
| 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 |
+------------+---------+--------+--------+
Result:
clusterA = [user1,user3]
clusterB = [user2,user4]
You could use the Jaccard distance for each pair of points.
In R:
# create data table
mat = data.frame(s1 = c(T,F,T,F), s2 = c(F,T,F,T),
s3 = c(T,F,T,F), s4 = c(F,T,F,T))
Result:
s1 s2 s3 s4
1 TRUE FALSE TRUE FALSE
2 FALSE TRUE FALSE TRUE
3 TRUE FALSE TRUE FALSE
4 FALSE TRUE FALSE TRUE
dist(mat, method="binary") # jaccard distance
Result:
1 2 3
2 1
3 0 1
4 1 0 1
Row 3 has a distance of 1 from row 4. By chance, the distances are all exactly 1 and 0 here. These are actually floats. (Your toy dataset may be too simplistic here)
Cluster them:
hclust(dist(mat, method="binary"))
Result (no so informative):
Call:
hclust(d = dist(mat, method = "binary"))
Cluster method : complete
Distance : binary
Number of objects: 4
Create dendrogram plot
plot(hclust(dist(mat, method="binary")))