I have a dataframe in R:
col1 col2 col3 col4 col5
1 1 a x 10.5 FALSE
2 2 b y 20.3 TRUE
3 3 c z 30.7 FALSE
4 4 apple pie: w 40.1 TRUE
5 5 e v 50.9 apple
sample_df <- structure(list(
col1 = c(1, 2, 3, 4, 5),
col2 = c("a", "b", "c", "apple pie:", "e"),
col3 = c("x", "y", "z", "w", "v"),
col4 = c(10.5, 20.3, 30.7, 40.1, 50.9),
col5 = c(FALSE, TRUE, FALSE, TRUE, "apple")
), class = "data.frame", row.names = c(NA, -5L))
I want to find the first row where the word "apple" occurs and delete all rows above this.
I know how to do this in multiple steps:
first_apple_row <- min(which(apply(sample_df, 1, function(row) any(grepl("apple", row)))))
result_df <- sample_df[first_apple_row:nrow(sample_df),]
col1 col2 col3 col4 col5
4 4 apple pie: w 40.1 TRUE
5 5 e v 50.9 apple
Is there a function in R that can accomplish this more directly?
For the question:
Is there a function in R that can accomplish this more directly?
The simple answer is "no", there is not a single function in base R (whether base
package or one of its default-installed packages) that does exactly this task. It is not hard to come up with a relatively simple process or expression that does what you need.
If you don't mind an admittedly-inefficient paste
on the frame,
sample_df[ cumsum(grepl("apple", do.call(paste, sample_df))) > 0, ]
# col1 col2 col3 col4 col5
# 4 4 apple pie: w 40.1 TRUE
# 5 5 e v 50.9 apple
Updating the benchmarks, thanks to jay.sf for a prod to give a meaningful benchmark using a randomly sampled 50K row sample_df
(same possible values/classes):
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
1 r2 28.69ms 29.09ms 34.2 6.79MB 5.70 12 2 351ms <df> <Rprofmem> <bench_tm [14]>
2 r2_sb 8.61ms 9.14ms 109. 6.03MB 8.92 49 4 448ms <df> <Rprofmem> <bench_tm [53]>
3 tic 27.47ms 28.11ms 35.0 13.09MB 10.8 13 4 371ms <df> <Rprofmem> <bench_tm [17]>
4 fr2 27.27ms 27.79ms 35.8 12.33MB 10.2 14 4 391ms <df> <Rprofmem> <bench_tm [18]>
5 fr3 2.06ms 2.37ms 427. 6.03MB 48.9 157 18 368ms <df> <Rprofmem> <bench_tm [175]>
6 jaysf 9.44ms 9.82ms 100. 4.51MB 8.92 45 4 448ms <df> <Rprofmem> <bench_tm [49]>
7 s_b 3.92ms 4.33ms 230. 6.79MB 37.5 86 14 374ms <df> <Rprofmem> <bench_tm [100]>
The "r2_sb" code is a combination of my first block with s_baldur's recommendation to only work on character columns,
sample_df[ cumsum(grepl("apple", do.call(paste, Filter(is.character, sample_df)))) > 0, ]
Clearly the fastest is Friede's third code block (the for
loop) since it stops processing much sooner than most others. Honorable mention (still clearly dominant over all those other than "fr3") is s_baldur's answer.