rif-statementdplyrphylogenyape

Identifying list elements contained in another list that are both elements of a data frame


I have two data frames, DF1, DF2, each with two columns (a, b). One column (a) is a unique identifier the other is a column (b) with elements that contain a list. The list contains label names. I would like to search DF2$b elements to see if they are contained inside DF1$b, if so I'd like to create a new column, DF2$c, which takes the identifier in DF1a. The tricky part, is that I'd only like to take the unique identifier if it's the smallest union that exists in the data frame. As some background, this data is from a phylogenetic tree. The DF2 is a subsample of DF1. All tips in DF2 are contained in DF1. I want to compare the nodes of DF2 to DF1 (the node names are different), but I can identify the nodes from the tips that are descendents from each.

It would be easier if I explain with an example:

df1 <- data.frame(a = c(1486, 1485, 1484, 1483, 1482, 1481, 1480, 1479))
df1$b = list(c("KC792204", "KF150733", "KC792205"), c("KC792204", "KF150733", "KC792205", "JX987740", "KX148108", "JX987724"), c("KC792204", "KF150733", "KC792205", "KC791848"), c("KJ201900", "KJ201899", "KF535207"), c("KJ201900", "KJ201899", "KF535207", "AB817119", "AB817100"), c("GU731662", "GU731661", "KP319229", "KY428876"), c("GU731662", "GU731661", "MT826960"), c("GU731662", "GU731661", "MT826960", "AM689535", "GU731663"))

df2 <- data.frame(a = c(8645, 1247, 5879, 1548, 2487, 1245, 1247, 3695))
df2$b = list(c("KC792204", "KF150733"), c("KC792204", "KC792205", "KC791848"), c("KJ201900", "KF535207"), c("KC792204", "JX987740", "KX148108", "JX987724"), c("GU731662", "GU731661", "MT826960", "GU731663"), c("KJ201900", "KJ201899", "AB817119", "AB817100"), c("GU731661", "KP319229", "KY428876"), c("GU731662", "MT826960"))

I'd like to create a new column in df2, df2$c, which identifies the smallest list (or node) in df1 that contains df2$b. This new column is made by df1$a (the unique identifier). In the example, df2$c (in order would be)

c("1486,1484,1483,1485,1479,1482,1481,1480")

To take the first two as an example:

df2$a is c("KC792204", "KF150733")

This can be found in df1$b[1], df1$b[2], df1$b[3], or 1486, 1485, or 1484. Since I am looking for the smallest length list, the result is 1486. 1486 is the smallest length list that contains all labels that are searched. The next list in df2$b is c("KC792204", "KF150733", "KC791848"). This result is 1484, since only list 1484 in df$1b contains those three labels.

I have tried:

df2$c <- ifelse(df2$b %in% df1$b, df1$a, 'other')

But I am instead comparing the lists as a whole rather than the elements inside each list. I also need to find the smallest of the lists that contain the searched labels.


Solution

  • Here is an approach, using data.table, and a helper function

    library(data.table)
    setDT(df1)[, l:=sapply(b,length)]
    f <- function(k) df1[sapply(df1$b,\(i) all(k %chin% i))][l==min(l),a]
    setDT(df2)[, c:=sapply(b,f)]
    

    Output (df2)

           a                                   b     c
       <num>                              <list> <num>
    1:  8645                   KC792204,KF150733  1486
    2:  1247          KC792204,KC792205,KC791848  1484
    3:  5879                   KJ201900,KF535207  1483
    4:  1548 KC792204,JX987740,KX148108,JX987724  1485
    5:  2487 GU731662,GU731661,MT826960,GU731663  1479
    6:  1245 KJ201900,KJ201899,AB817119,AB817100  1482
    7:  1247          GU731661,KP319229,KY428876  1481
    8:  3695                   GU731662,MT826960  1480
    

    Explanation:

    Update Aug 4 2022:

    The OP has asked if the helper function above might be adjusted so that the a value is selected from among the rows in df1 where more than p elements of df2$b are found in df1$b.. Here is one possible adjustment to the helper function:

    f <- function(k, p=1) {
      df1[sapply(df1$b,\(i) mean(k %chin% i)>=p)][l==min(l),a]
    } 
    

    Note that since k %chin% i returns a logical vector, the mean of this is simply the proportion of elements in k (i.e. df2$b) that are in i i.e. df1%b.. sapply() still returns a logical vector, because we then ask if that proportion equals or exceeds p. Note that since the default of p is 1, using this function without specifying p will return the same result as the original helper function.

    To select the a value from among the rows whether 95% of the elements match, call the modified helper function, f, as before, but set p to 0.95

    setDT(df2)[, c:=sapply(b,f, p=0.95)]