I have a balanced panel data set df with columns:
id, time, var1, var2, var3, var4....
Some values in var1
and var4
contain Na elements.
If they contain the NA element, I want to remove all the corresponding id rows that contain the NA elements so that my data remains balanced.
So if var1
contains Na that corresponds to id=28
, I want to remove all the id=28
rows.
Can I do that using Dplyr in R?
I did something like this:
df %>% select(c(var1,var4)) %>% group_by(id) %>% filter( !is.na(.))
But it doesn't work. Could you please provide some guidance?
Always provide sample data to get better and faster answers.
Try filter
with if_all
and starts_with
.
The code:
# Library(tidyverse)
set.seed(128)
aux <- tibble(
id = 1:10,
time = now() + seconds(sample(c(rep(NA, 1000), -3600:3600), 10, replace = TRUE)) %>% hms::hms(),
var1 = sample(c(NA, 1:10), 10, replace = TRUE),
var2 = sample(c(NA, 1:10), 10, replace = TRUE),
var3 = sample(c(NA, 1:10), 10, replace = TRUE),
var4 = sample(c(NA, 1:10), 10, replace = TRUE))
aux <- filter(aux, if_all(starts_with("var"), \(x) !is.na(x)))
Input:
> aux
# A tibble: 10 × 6
id time var1 var2 var3 var4
<int> <dttm> <int> <int> <int> <int>
1 1 2024-04-11 16:30:32 NA NA 3 1
2 2 2024-04-11 16:11:08 3 5 1 9
3 3 2024-04-11 14:47:42 5 2 9 4
4 4 2024-04-11 15:43:26 7 6 6 NA
5 5 NA 6 2 5 10
6 6 2024-04-11 15:18:34 5 3 4 NA
7 7 2024-04-11 15:54:08 9 NA 7 8
8 8 2024-04-11 16:19:12 1 10 7 9
9 9 2024-04-11 15:46:24 2 NA 2 NA
10 10 2024-04-11 15:32:55 1 1 9 6
Output:
# A tibble: 5 × 6
id time var1 var2 var3 var4
<int> <dttm> <int> <int> <int> <int>
1 2 2024-04-11 16:11:08 3 5 1 9
2 3 2024-04-11 14:47:42 5 2 9 4
3 5 NA 6 2 5 10
4 8 2024-04-11 16:19:12 1 10 7 9
5 10 2024-04-11 15:32:55 1 1 9 6