r list algorithm cluster-analysis reduce

Collapse List into Unique Combinations Using Index Values

Firstly, I wish I could think of a better way to describe my problem. I'm trying to take the following example list:

lst<- list(c(2), c(1,3), c(2), c(4), c(6), c(5, 7), c(6,8), c(7))

And turn it into this:

res<- list(c(1,2,3), c(1,2,3), c(1,2,3), c(4), c(5,6,7,8), c(5,6,7,8),c(5,6,7,8), c(5,6,7,8))

OR even this:

also_good_answer<- list(c(1, 2, 3), logical(0), logical(0), 4, c(5, 6, 7, 8), logical(0), 
    logical(0), logical(0))

I've been using a couple while statements in combination with setdiff to do this, but I'm wondering if there is an eloquent and faster way to do this over a large list?

As always, thank you in advance. -nate

My method:

lst<- list(c(2), c(1,3), c(2), c(4), c(6), c(5, 7), c(6,8), c(7)) # Original List
to_sequence<- 1:length(lst) 
res<- lapply(1:length(lst), function(x) {return(NA)})  # Building Result Object

while(length(to_sequence) > 0){
  tres<- c()
  idx<- to_sequence[1]
  next_idxs<- setdiff(lst[[idx]], NA)
  tres<- c(tres, next_idxs)
  while(length(next_idxs) >=1){
    next_idxs<- lapply(seq_along(next_idxs), function(x){
      lst[[next_idxs[x]]]
    } ) %>% unlist() %>% setdiff(., unlist(tres)) # uses slow setdiff
    tres<-c(tres, next_idxs)
  }
  res[[idx]]<- tres
  to_sequence<- setdiff(to_sequence, tres) # Another slow setdiff
  cat("Length to_sequence:", base::prettyNum(length(to_sequence),big.mark = ","), "\n") 
}

res<- lapply(res, sort)

EDITS
Updated lst: lst<- list(c(2), c(1,3), c(2,8), c(4), c(6), c(5, 7), c(6,9), c(3),c(7))

Giving Desired Output: list(c(1, 2, 3, 8), logical(0), logical(0), 4, c(5, 6, 7, 9), logical(0), logical(0), logical(0), logical(0))

Solution

Update

According to the updated example and expected output, we use replace and duplicated to nullify the cluster duplicates (see the last line out <- ...)

library(igraph)
g <- graph_from_data_frame(stack(setNames(lst, seq_along(lst))))
clst <- membership(components(g))
nms <- as.integer(names(clst))
res <- unname(by(nms, clst, sort))[clst[order(nms)]]
out <- replace(res, duplicated(res), list(logical(0)))

and we can obtain

> out
[[1]]
[1] 1 2 3 8

[[2]]
logical(0)

[[3]]
logical(0)

[[4]]
[1] 4

[[5]]
[1] 5 6 7 9

[[6]]
logical(0)

[[7]]
logical(0)

[[8]]
logical(0)

[[9]]
logical(0)

Previous Answer

You can use membership from igraph to detect the clusters

library(igraph)
g <- graph_from_data_frame(stack(setNames(lst, seq_along(lst))))
clst <- membership(components(g))
nms <- as.integer(names(clst))
res <- unname(by(nms, clst, sort))[clst[order(nms)]]

and you will obtain

> res
[[1]]
[1] 1 2 3

[[2]]
[1] 1 2 3

[[3]]
[1] 1 2 3

[[4]]
[1] 4

[[5]]
[1] 5 6 7 8

[[6]]
[1] 5 6 7 8

[[7]]
[1] 5 6 7 8

[[8]]
[1] 5 6 7 8

Graph Interpretation

If you are interested in how the problem is interpreted into a graph, you can use plot(as.undirected(g)), which shows