cluster-computingcluster-analysisbinary-matrix

what is the appropriate method to cluster binary matrix


I am a beginner in clustering, and I have a binary matrix in which each student have the sessions they are enrolled in. I want to cluster students with same sessions.

clustering methods are so many and varies according to the dataset

for exemple k-means is not appropriate, because the data is binary and the standard "mean" operation does not make much sense for binary.

i'm open to any suggestion

Here's an example:

+------------+---------+--------+--------+
|  session1  | session2|session3|session4|
+------------+---------+--------+--------+
|     1      |    0    |   1    |    0   |
|     0      |    1    |   0    |    1   |
|     1      |    0    |   1    |    0   | 
|     0      |    1    |   0    |    1   |
+------------+---------+--------+--------+

Result:

clusterA = [user1,user3]

clusterB = [user2,user4]


Solution

  • You could use the Jaccard distance for each pair of points.

    In R:

    # create data table
    mat = data.frame(s1 = c(T,F,T,F), s2 = c(F,T,F,T), 
                     s3 = c(T,F,T,F), s4 = c(F,T,F,T))
    

    Result:

         s1    s2    s3    s4
    1  TRUE FALSE  TRUE FALSE
    2 FALSE  TRUE FALSE  TRUE
    3  TRUE FALSE  TRUE FALSE
    4 FALSE  TRUE FALSE  TRUE
    
     dist(mat, method="binary") # jaccard distance
    

    Result:

      1 2 3
    2 1    
    3 0 1  
    4 1 0 1
    

    Row 3 has a distance of 1 from row 4. By chance, the distances are all exactly 1 and 0 here. These are actually floats. (Your toy dataset may be too simplistic here)

    Cluster them:

    hclust(dist(mat, method="binary"))
    

    Result (no so informative):

    Call:
    hclust(d = dist(mat, method = "binary"))
    
    Cluster method   : complete 
    Distance         : binary 
    Number of objects: 4 
    

    Create dendrogram plot

    plot(hclust(dist(mat, method="binary")))
    

    dendrogram