Given the dataset below...
col1 col2 var1 var2 var3
1 1 NA NA NA
2 2 NA NA NA
3 3 NA 3 NA
4 4 4 4 4
5 5 5 5 5
6 6 6 NA 6
7 7 NA 7 7
8 8 NA NA NA
9 9 9 NA NA
10 10 NA NA NA
...how can I remove rows containing NAs for all select columns only at the start and end in an automated way (and preferably using tidyverse)?
col1 col2 var1 var2 var3
3 3 NA 3 NA
4 4 4 4 4
5 5 5 5 5
6 6 6 NA 6
7 7 NA 7 7
8 8 NA NA NA
9 9 9 NA NA
So in the example above, how can I remove those rows containing NAs for all of columns var1-var3 only at the start and end of the dataset? That is, only rows 1, 2 and 10 should disappear. (Although row 8 contains NAs for all of these select variables, it should not be removed as it is not at the 'start' or 'end' of the dataset.)
Here is my tidyverse solution:
library(tidyverse)
have <- tibble(col1 = 1:10, col2 = 1:10, var1 = 1:10, var2 = 1:10, var3 = 1:10)
have[c(1, 2, 10), 3:5] <- NA
have[3, c(3, 5)] <- NA
have[6, 4] <- NA
have[7, 3] <- NA
have[8, 3:5] <- NA
have[9, 4:5] <- NA
no_select_vars <- 3 # The number of select variables
want <- have |>
mutate(no_missing = rowSums(across(-c(col1, col2), ~ is.na(.x)))) |>
slice(first(which(no_missing < no_select_vars)):n()) |>
slice(1:last(which(no_missing < no_select_vars))) |>
select(-no_missing)
Is there an existing function for this purpose, or a more elegant solution than mine?
One approach is to create an indicator if all of the selected columns are missing (in this case, no_miss
if not all are missing).
Then, you can filter
with cumany
removing all rows until a TRUE is reached in no_miss
at beginning, and then same logic in rev
erse at the end.
library(tidyverse)
have %>%
mutate(no_miss = !if_all(contains("var"), is.na)) %>%
filter(cumany(no_miss) & rev(cumany(rev(no_miss))))
Output
col1 col2 var1 var2 var3 no_miss
<int> <int> <int> <int> <int> <lgl>
1 3 3 NA 3 NA TRUE
2 4 4 4 4 4 TRUE
3 5 5 5 5 5 TRUE
4 6 6 6 NA 6 TRUE
5 7 7 NA 7 7 TRUE
6 8 8 NA NA NA FALSE
7 9 9 9 NA NA TRUE