rlistsubset

Subsetting lists in R to get longest list


I have a list with items with varying numbers of elements, many of which are subsets of each other. I would like to remove all the items whose complete set of elements exist in other items

So for instance for:


listy<-list("item1"=c(10,210,300,400,500,600), "item2"=c(10,210,300), "item3"=c(500,600), "item4"=c(210, 300), "item5"=c(700,800,900))

Items 2,3, and 4 are subsets of item1, so my desired outcome is:

listy2<-c("item1"=c(10,210,300,400,500,600), "item5"=c(700,800,900))

So far I have tried converting it to a tibble, sorting by the first column, and then removing duplicates of the first column. But this is super inefficient and only removes the ones where the first column matches, and not where the later ones match (ex. item3 and item4 vs. item1 here). Or I could do a loop for an all vs. all grepl making search strings from each item in the list Ex something like this, but doesn't actually work because of the search string:

for(ity in 1:length(listy)){
    if(grepl(paste(unlist(listy[[ity]]), sep = "|"), listy[[c(1:3)[-ity]]])){ 
print(ity)
}}

But again, this would be super inefficient (and my actual lists have 100000 items or more with up to 20,000 elements each). I am sure there is some super simple function I am missing and any help would be greatly appreciated.


Solution

  • idx <- sapply(seq_along(listy), \(i) any(sapply(listy[-i], \(j) all(listy[[i]] %in% j))))
    listy[!idx]
    # $item1
    # [1]  10 210 300 400 500 600 500
    # 
    # $item5
    # [1] 700 800 900
    

    How it works

    This is essentially a double loop. The first sapply is iterating over an index (i) of all elements in your list. The second sapply iterates over all the elements of your list except i and compares them. You need to remove that index otherwise, there will always be a match.

    Here is a version of what is happening to help you visualize:

    for (i in seq_along(listy)) {
      for (j in listy[-i]) {
        lgl <- all(list[[i]] %in% j)
        cat("All c(", toString(listy[[i]]), ") in c(", toString(j), ")? --> ", lgl, "\n", sep = "")
      }
    }