rdataframedata.tableintervalsiranges

Getting mean of multiple rows based on interval dataframe in R


Let's say I have the following dataframe

df1=read.table(text="ID POSITION S1 S2 
1   1 10 10
1   2 20 0
1   3 10 0
1   4 20 0
1   5 10 50
2   1 10 0
2   2 20 10
2   3 20 10
2   4 20 10
2   5 20 10", header=TRUE)

ID and POSITION are ordered by ID then POSITION

Then another dataframe with coordinates

df2=read.table(text="ID POSITION_START POSITION_END
1  1              3
1  4              5
2  1              5", header=TRUE)

How can I get the means of S1 and S2 for those specific intervals based on ID and position from df1?

Final dataframe would be

df3=read.table(text="ID POSITION_START POSITION_END S1 S2
1  1              3            13.33  3.3
1  4              5            15  25
2  1              5            18  8", header=TRUE)

Where S1 and S2 would be means for the intervals (first row is ID=1 rows 1 to 3) I think it might be important that positions are not always equal to row number

I've tried using the iRanges library to no avail


Solution

  • setDT(df1)
    setDT(df2)
    
    df1[
      df2,
      .(ID, POSITION_START, POSITION_END, S1, S2),
      on = .(ID, POSITION >= POSITION_START, POSITION <= POSITION_END)
    ][, lapply(.SD, mean), by = ID:POSITION_END]
    

    and you will obtain

          ID POSITION_START POSITION_END       S1        S2
       <int>          <int>        <int>    <num>     <num>
    1:     1              1            3 13.33333  3.333333
    2:     1              4            5 15.00000 25.000000
    3:     2              1            5 18.00000  8.000000
    

    aggregate(
      . ~ ID + POSITION_START + POSITION_END,
      subset(
        merge(df1, df2, all.x = TRUE),
        POSITION >= POSITION_START & POSITION <= POSITION_END,
        select = -POSITION
      ), mean
    )
    

    which gives

      ID POSITION_START POSITION_END       S1        S2
    1  1              1            3 13.33333  3.333333
    2  2              1            5 18.00000  8.000000
    3  1              4            5 15.00000 25.000000