I have a column in a dataset looking like this:
cluster_id
1
1
1
1
NA
1
NA
NA
2
NA
2
NA
3
NA
NA
3
cluster_id <- c("1","1","1","1","NA","1","NA","NA","2","NA","2","NA","3","NA","NA","3")
The order is already pre-defined before using a time column. What I want is to substitute the NA's that are within each cluster ID, i.e. if there's a row with 2, then an NA, and then a 2 again, I want that NA to become 2. The NA's between numbers stay as NA's. Example:
cluster_id cluster_id_new
1 1
1 1
1 1
1 1
NA 1
1 1
NA NA
NA NA
2 2
NA 2
2 2
NA NA
3 3
NA 3
NA 3
3 3
I found the zoo::na.locf
function in this post, which seems to be close to what I want, but I also need to take in consideration the value after the NA.
Any thoughts?
Update: thanks to @Darren Tsai (who provided a better version):
library(dplyr)
library(tidyr)
tibble(cluster_id) %>%
mutate(down = cluster_id, up = cluster_id) %>%
fill(down, .direction = "down") %>%
fill(up, .direction = "up") %>%
mutate(cluster_id_new = if_else(down == up, down, NA)) %>%
select(-c(down, up))
Original answer:
Here is tidyverse
way using mainly fill
:
library(dplyr)
library(tidyr)
tibble(cluster_id) %>%
mutate(helper = row_number(),
filldown = fill(., cluster_id, .direction = "down")$cluster_id,
fillup = fill(., cluster_id, .direction = "up")$cluster_id,
cluster_id_new = ifelse(filldown == fillup, filldown, NA_real_)) %>%
select(cluster_id, cluster_id_new)
cluster_id cluster_id_new
<dbl> <dbl>
1 1 1
2 1 1
3 1 1
4 1 1
5 NA 1
6 1 1
7 NA NA
8 NA NA
9 2 2
10 NA 2
11 2 2
12 NA NA
13 3 3
14 NA 3
15 NA 3
16 3 3