rdataframesubsetnatrim

Trim a data frame from the top or bottom, according to a column with NAs


I have a data frame similar to final_df below, which I obtain like this:

datadf <- data.frame(time=1:100, A=rnorm(100), B=rnorm(100), C=rnorm(100))
datadf <- as.data.frame(tidyr::pivot_longer(datadf, -time, names_to = "letter", values_to = "value"))
calldf <- data.frame(call=sample(LETTERS[1:3], 10, replace=T),
                     time=seq(from=5, by=9, length=10),
                     callnum=1:10, callnum_rev=10:1)
final_df <- dplyr::left_join(datadf, calldf, by="time")

And it looks like this:

> final_df
    time letter        value call callnum callnum_rev
1      1      A  0.008684912 <NA>      NA          NA
2      1      B -0.229217147 <NA>      NA          NA
3      1      C  1.675954809 <NA>      NA          NA
4      2      A  0.011244567 <NA>      NA          NA
5      2      B  0.593363139 <NA>      NA          NA
6      2      C -1.124759422 <NA>      NA          NA
7      3      A  0.271118167 <NA>      NA          NA
8      3      B  1.768275352 <NA>      NA          NA
9      3      C  0.287273528 <NA>      NA          NA
10     4      A  1.083097563 <NA>      NA          NA
11     4      B  1.990573158 <NA>      NA          NA
12     4      C -0.953178506 <NA>      NA          NA
13     5      A -0.076200582    A       1          10
14     5      B  1.159319799    A       1          10
15     5      C -0.484462797    A       1          10
16     6      A  0.506393522 <NA>      NA          NA
17     6      B -0.847847907 <NA>      NA          NA
18     6      C -0.429095586 <NA>      NA          NA
19     7      A -0.591182884 <NA>      NA          NA
20     7      B  0.996488879 <NA>      NA          NA
...

The only thing I want to do is to be able to trim the n top rows and/or the n bottom rows according to the variables callnum (counts from the top) and/or callnum_rev (counts from the bottom).

Some examples:

What actually would be the absolute best, is to leave an amount of NA rows above the top row, and an amount of NA rows below the bottom row, so that amount is half the NA rows between that value and the previous/next.

In this case that is simple because I am always leaving 8 NA rows between two call values (seq by 9)... So in the example of a value for top=3 and a value for bottom=4, I would like my resulting data frame to go from row 55 (67-12, 12=8/23, as in 3 letters) to row 189 (177+12, 12=8/23, as in 3 letters).

This can get convoluted, because in my real data, the lines with callnum values can have a variable number of NA rows in between.

Any help?


Solution

  • head_tail <- function(data, top=NULL, bottom=NULL) {
        if(!is.null(top)) 
          s1 <- min(which(data[,'callnum']==top))
        else
          s1 <- 1
        
        if(!is.null(bottom))
          s2 <- max(which(data[,'callnum_rev']==bottom))
        else
          s2 <- nrow(data)
    
        return(data[seq(s1, s2),])
    }
    
    library(data.table)
    final_dt <- setDT(final_df)
    
    head_tail(final_dt, top=3)
    

          time letter       value   call callnum callnum_rev
         <num> <char>       <num> <char>   <int>       <int>
      1:    23      A -1.15224027      A       3           8
      2:    23      B -0.22485125      A       3           8
      3:    23      C -0.36342374      A       3           8
      4:    24      A  0.65140424   <NA>      NA          NA
      5:    24      B  0.07724828   <NA>      NA          NA
     ---                                                    
    230:    99      B -1.13818328   <NA>      NA          NA
    231:    99      C -0.00105447   <NA>      NA          NA
    232:   100      A  0.41389275   <NA>      NA          NA
    233:   100      B -2.48077559   <NA>      NA          NA
    234:   100      C  1.32573054   <NA>      NA          NA
    
    head_tail(final_dt, bottom=4)
    

          time letter      value   call callnum callnum_rev
         <num> <char>      <num> <char>   <int>       <int>
      1:     1      A -0.8785872   <NA>      NA          NA
      2:     1      B  0.6806222   <NA>      NA          NA
      3:     1      C  1.5115759   <NA>      NA          NA
      4:     2      A -0.9589247   <NA>      NA          NA
      5:     2      B -0.7238646   <NA>      NA          NA
     ---                                                   
    173:    58      B -0.3192178   <NA>      NA          NA
    174:    58      C  1.4266444   <NA>      NA          NA
    175:    59      A -2.1541144      A       7           4
    176:    59      B -0.4405924      A       7           4
    177:    59      C  0.2283863      A       7           4
    
    head_tail(final_dt, top=3, bottom=4)
    

          time letter       value   call callnum callnum_rev
         <num> <char>       <num> <char>   <int>       <int>
      1:    23      A -1.15224027      A       3           8
      2:    23      B -0.22485125      A       3           8
      3:    23      C -0.36342374      A       3           8
      4:    24      A  0.65140424   <NA>      NA          NA
      5:    24      B  0.07724828   <NA>      NA          NA
     ---                                                    
    107:    58      B -0.31921775   <NA>      NA          NA
    108:    58      C  1.42664445   <NA>      NA          NA
    109:    59      A -2.15411443      A       7           4
    110:    59      B -0.44059240      A       7           4
    111:    59      C  0.22838630      A       7           4