rdataframesubset

Find the first row in a data frame that satisfies a condition and delete everything above?


I have a dataframe in R:

  col1  col2 col3 col4  col5
1    1     a    x 10.5 FALSE
2    2     b    y 20.3  TRUE
3    3     c    z 30.7 FALSE
4    4 apple pie: w 40.1  TRUE
5    5     e    v 50.9 apple

sample_df <- structure(list(
  col1 = c(1, 2, 3, 4, 5),
  col2 = c("a", "b", "c", "apple pie:", "e"),
  col3 = c("x", "y", "z", "w", "v"),
  col4 = c(10.5, 20.3, 30.7, 40.1, 50.9),
  col5 = c(FALSE, TRUE, FALSE, TRUE, "apple")
), class = "data.frame", row.names = c(NA, -5L))

I want to find the first row where the word "apple" occurs and delete all rows above this.

I know how to do this in multiple steps:

first_apple_row <- min(which(apply(sample_df, 1, function(row) any(grepl("apple", row)))))
result_df <- sample_df[first_apple_row:nrow(sample_df),]

  col1       col2 col3 col4  col5
4    4 apple pie:    w 40.1  TRUE
5    5          e    v 50.9 apple

Is there a function in R that can accomplish this more directly?


Solution

  • For the question:

    Is there a function in R that can accomplish this more directly?

    The simple answer is "no", there is not a single function in base R (whether base package or one of its default-installed packages) that does exactly this task. It is not hard to come up with a relatively simple process or expression that does what you need.


    If you don't mind an admittedly-inefficient paste on the frame,

    sample_df[ cumsum(grepl("apple", do.call(paste, sample_df))) > 0, ]
    #   col1       col2 col3 col4  col5
    # 4    4 apple pie:    w 40.1  TRUE
    # 5    5          e    v 50.9 apple
    

    Updating the benchmarks, thanks to jay.sf for a prod to give a meaningful benchmark using a randomly sampled 50K row sample_df (same possible values/classes):

      expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time            
      <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>          
    1 r2          28.69ms  29.09ms      34.2    6.79MB     5.70    12     2      351ms <df>   <Rprofmem> <bench_tm [14]> 
    2 r2_sb        8.61ms   9.14ms     109.     6.03MB     8.92    49     4      448ms <df>   <Rprofmem> <bench_tm [53]> 
    3 tic         27.47ms  28.11ms      35.0   13.09MB    10.8     13     4      371ms <df>   <Rprofmem> <bench_tm [17]> 
    4 fr2         27.27ms  27.79ms      35.8   12.33MB    10.2     14     4      391ms <df>   <Rprofmem> <bench_tm [18]> 
    5 fr3          2.06ms   2.37ms     427.     6.03MB    48.9    157    18      368ms <df>   <Rprofmem> <bench_tm [175]>
    6 jaysf        9.44ms   9.82ms     100.     4.51MB     8.92    45     4      448ms <df>   <Rprofmem> <bench_tm [49]> 
    7 s_b          3.92ms   4.33ms     230.     6.79MB    37.5     86    14      374ms <df>   <Rprofmem> <bench_tm [100]>
    

    The "r2_sb" code is a combination of my first block with s_baldur's recommendation to only work on character columns,

    sample_df[ cumsum(grepl("apple", do.call(paste, Filter(is.character, sample_df)))) > 0, ]
    

    Clearly the fastest is Friede's third code block (the for loop) since it stops processing much sooner than most others. Honorable mention (still clearly dominant over all those other than "fr3") is s_baldur's answer.