rtidyverse

R: Assign column value based on range of values found in columns in another dataframe


For simplicity, in dataframe 1 I have 3 genes (a, b, and c) and their positions in the genome. For instance, gene "a" starts at position 1 (min) and ends at 2 (max). In dataframe 2, I have a mutation's position, which may occur in the a gene (between df1$min and df1$max) and impact it:

df1 = data.frame("gene" = c("a","b","c"), "min" = c(1,3,5), "max"=c(2,4,6))
df2 = data.frame("position" = c(1.5,3.5,5.5),"impact" = c("low","low","high"))

I would like to make a dataframe which shows the mutation position, the gene it is in, and it's impact. Like so:

position  gene     impact
1.5       a        low
3.5       b        low
5.5       c        high

Thank you.


Solution

  • Here is a base R option

    transform(
      df2,
      gene = with(
        df1,
        {
          d <- outer(position, min, ">=") & outer(position, max, "<=")
          c(NA, gene)[1 + rowSums(d * col(d))]
        }
      )
    )
    

    which gives

      position impact gene
    1      1.5    low    a
    2      3.5    low    b
    3      5.5   high    c
    

    where c(NA, gene)[1 + rowSums(d * col(d))] was applied in case no matched gene was found.