rlistalgorithmcluster-analysisreduce

Collapse List into Unique Combinations Using Index Values


Firstly, I wish I could think of a better way to describe my problem. I'm trying to take the following example list:

lst<- list(c(2), c(1,3), c(2), c(4), c(6), c(5, 7), c(6,8), c(7))

And turn it into this:

res<- list(c(1,2,3), c(1,2,3), c(1,2,3), c(4), c(5,6,7,8), c(5,6,7,8),c(5,6,7,8), c(5,6,7,8))

OR even this:

also_good_answer<- list(c(1, 2, 3), logical(0), logical(0), 4, c(5, 6, 7, 8), logical(0), 
    logical(0), logical(0))

I've been using a couple while statements in combination with setdiff to do this, but I'm wondering if there is an eloquent and faster way to do this over a large list?

As always, thank you in advance. -nate

My method:

lst<- list(c(2), c(1,3), c(2), c(4), c(6), c(5, 7), c(6,8), c(7)) # Original List
to_sequence<- 1:length(lst) 
res<- lapply(1:length(lst), function(x) {return(NA)})  # Building Result Object

while(length(to_sequence) > 0){
  tres<- c()
  idx<- to_sequence[1]
  next_idxs<- setdiff(lst[[idx]], NA)
  tres<- c(tres, next_idxs)
  while(length(next_idxs) >=1){
    next_idxs<- lapply(seq_along(next_idxs), function(x){
      lst[[next_idxs[x]]]
    } ) %>% unlist() %>% setdiff(., unlist(tres)) # uses slow setdiff
    tres<-c(tres, next_idxs)
  }
  res[[idx]]<- tres
  to_sequence<- setdiff(to_sequence, tres) # Another slow setdiff
  cat("Length to_sequence:", base::prettyNum(length(to_sequence),big.mark = ","), "\n") 
}

res<- lapply(res, sort)

EDITS
Updated lst: lst<- list(c(2), c(1,3), c(2,8), c(4), c(6), c(5, 7), c(6,9), c(3),c(7))

Giving Desired Output: list(c(1, 2, 3, 8), logical(0), logical(0), 4, c(5, 6, 7, 9), logical(0), logical(0), logical(0), logical(0))


Solution

  • Update

    According to the updated example and expected output, we use replace and duplicated to nullify the cluster duplicates (see the last line out <- ...)

    library(igraph)
    g <- graph_from_data_frame(stack(setNames(lst, seq_along(lst))))
    clst <- membership(components(g))
    nms <- as.integer(names(clst))
    res <- unname(by(nms, clst, sort))[clst[order(nms)]]
    out <- replace(res, duplicated(res), list(logical(0)))
    

    and we can obtain

    > out
    [[1]]
    [1] 1 2 3 8
    
    [[2]]
    logical(0)
    
    [[3]]
    logical(0)
    
    [[4]]
    [1] 4
    
    [[5]]
    [1] 5 6 7 9
    
    [[6]]
    logical(0)
    
    [[7]]
    logical(0)
    
    [[8]]
    logical(0)
    
    [[9]]
    logical(0)
    

    Previous Answer

    You can use membership from igraph to detect the clusters

    library(igraph)
    g <- graph_from_data_frame(stack(setNames(lst, seq_along(lst))))
    clst <- membership(components(g))
    nms <- as.integer(names(clst))
    res <- unname(by(nms, clst, sort))[clst[order(nms)]]
    

    and you will obtain

    > res
    [[1]]
    [1] 1 2 3
    
    [[2]]
    [1] 1 2 3
    
    [[3]]
    [1] 1 2 3
    
    [[4]]
    [1] 4
    
    [[5]]
    [1] 5 6 7 8
    
    [[6]]
    [1] 5 6 7 8
    
    [[7]]
    [1] 5 6 7 8
    
    [[8]]
    [1] 5 6 7 8
    

    Graph Interpretation

    If you are interested in how the problem is interpreted into a graph, you can use plot(as.undirected(g)), which shows

    enter image description here