rgroupingidentifierlinkage

identify groups of linked episodes which chain together


Take this simple data frame of linked ids:

test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))

> test
  id1 id2
1  10   1
2  10  36
3   1  24
4   1  45
5  24 300
6   8  11

I now want to group together all the ids which link. By 'link', I mean follow through the chain of links so that all ids in one group are labelled together. A kind of branching structure. i.e:

Group 1
10 --> 1,   1 --> (24,45)
                   24 --> 300
                          300 --> NULL
                   45 --> NULL
10 --> 36, 36 --> NULL,
Final group members: 10,1,24,36,45,300

Group 2
8 --> 11
      11 --> NULL
Final group members: 8,11

Now I roughly know the logic I would want, but don't know how I would implement it elegantly. I am thinking of a recursive use of match or %in% to go down each branch, but am truly stumped this time.

The final result I would be chasing is:

result <- data.frame(group=c(1,1,1,1,1,1,2,2),id=c(10,1,24,36,45,300,8,11))

> result
  group  id
1     1  10
2     1   1
3     1  24
4     1  36
5     1  45
6     1 300
7     2   8
8     2  11

Solution

  • The Bioconductor package RBGL (an R interface to the BOOST graph library) contains a function, connectedComp(), which identifies the connected components in a graph -- just what you are wanting.

    (To use the function, you will first need to install the graph and RBGL packages, available here and here.)

    library(RBGL)
    test <- data.frame(id1=c(10,10,1,1,24,8),id2=c(1,36,24,45,300,11))
    
    ## Convert your 'from-to' data to a 'node and edge-list' representation  
    ## used by the 'graph' & 'RBGL' packages 
    g <- ftM2graphNEL(as.matrix(test))
    
    ## Extract the connected components
    cc <- connectedComp(g)
    
    ## Massage results into the format you're after 
    ld <- lapply(seq_along(cc), 
                 function(i) data.frame(group = names(cc)[i], id = cc[[i]]))
    do.call(rbind, ld)
    #   group  id
    # 1     1  10
    # 2     1   1
    # 3     1  24
    # 4     1  36
    # 5     1  45
    # 6     1 300
    # 7     2   8
    # 8     2  11