I have a dataframe in R with several ID, DAY and TIME and amount of a compound (AMT). Typically, for every ID, there should two records at every day, indicating two doses a day, typically in the morning (at around 8 am) and evening (at around 8 pm). Now sometimes the DAY column may indicate "impute" which indicates same dosing as before until there is again an actual DAY value. If this is the case, and the column comment_yh
indicates "blue", then I want to impute days. In the end the dataframe should contain the original TIME points (e.g. 8:05 or 19:53) and the imputed ones which are always 8:00 and 20:00.
A minimal example could be:
df <- data.frame(
ID = c(4, 4, 4, 4, 4, 4,
5, 5, 5, 5,
6, 6, 6, 6),
DAY = c("14/02/2020", "14/02/2020", "15/02/2020", "impute", "18/02/2020", "18/02/2020",
"13/02/2020", "impute", "15/02/2020", "15/02/2020",
"13/02/2020", "impute", "15/02/2020", "15/02/2020"),
TIME = c("8:05", "19:53", "7:45", "NA", "8:10", "20:01",
"8:01", "NA", "8:00", "19:50",
"8:02", "NA", "8:02", "20:06"),
AMT = c(3, 3, 2, NA, 4, 5,
3.5, NA, 3, 4,
2, NA, 1, 2),
comment_yh = c(NA, NA, NA, "blue", NA, NA,
NA, "blue", NA, NA,
NA, "red", NA, NA)
)
Where the resulting, imputed dataframe should like this:
df_final <- data.frame(
ID = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5,
6, 6, 6, 6),
DAY = c("14/02/2020", "14/02/2020", "15/02/2020", "15/02/2020", "16/02/2020", "16/02/2020", "17/02/2020", "17/02/2020", "18/02/2020", "18/02/2020",
"13/02/2020", "13/02/2020", "14/02/2020", "14/02/2020", "15/02/2020", "15/02/2020",
"13/02/2020", "impute", "15/02/2020", "15/02/2020"),
TIME = c("8:05", "19:53", "7:45", "20:00", "8:00", "20:00", "8:00", "20:00", "8:10", "20:01",
"8:01", "20:00", "8:00", "20:00", "8:00", "19:50",
"8:02", "NA", "8:02", "20:06"),
AMT = c(3, 3, 2, 2, 2, 2, 2, 2, 4, 5,
3.5, 3.5, 3.5, 3.5, 3, 4,
2, NA, 1, 2)
)
Any suggestion is very welcome!
I already tried to loop it but I am not very proficient with R and having problems with it.
To get your required output, you can do this:
library(dplyr)
library(tidyr)
df$DAY <- as.Date(df$DAY, "%d/%m/%Y")
result_df <- df # Create a copy to store results
for(i in 1:nrow(df)){
if(!is.na(df$comment_yh[i]) && df$comment_yh[i] == "blue"){
date_seq <- seq(df$DAY[i-1] + 1, df$DAY[i+1] - 1, by = "days") # Create sequence of dates
n <- length(date_seq)
if(n > 0){
result_df <- rbind(result_df,
data.frame( # Insert the new rows
ID = rep(df$ID[i], n*2+1),
DAY = c(df$DAY[i-1], rep(date_seq, each = 2)),
TIME = c("20:00", rep(c("8:00", "20:00"), n)),
AMT = rep(2.0, n*2+1), # Use dose amount 2.0
comment_yh = NA
)
)
}
}
}
result_df <- result_df %>%
filter(is.na(comment_yh) | comment_yh=="red") %>%
arrange(ID,DAY,TIME) %>%
select(-comment_yh) %>% # deselect comment_yh column
drop_na() # drop NAs in red row
Note: I dropped the row with "red" as comment_yh
ID | DAY | TIME | AMT |
---|---|---|---|
4 | 2020-02-14 | 19:53 | 3.0 |
4 | 2020-02-14 | 8:05 | 3.0 |
4 | 2020-02-15 | 20:00 | 2.0 |
4 | 2020-02-15 | 7:45 | 2.0 |
4 | 2020-02-16 | 20:00 | 2.0 |
4 | 2020-02-16 | 8:00 | 2.0 |
4 | 2020-02-17 | 20:00 | 2.0 |
4 | 2020-02-17 | 8:00 | 2.0 |
4 | 2020-02-18 | 20:01 | 5.0 |
4 | 2020-02-18 | 8:10 | 4.0 |
5 | 2020-02-13 | 20:00 | 2.0 |
5 | 2020-02-13 | 8:01 | 3.5 |
5 | 2020-02-14 | 20:00 | 2.0 |
5 | 2020-02-14 | 8:00 | 2.0 |
5 | 2020-02-15 | 19:50 | 4.0 |
5 | 2020-02-15 | 8:00 | 3.0 |
6 | 2020-02-13 | 8:02 | 2.0 |
6 | 2020-02-15 | 20:06 | 2.0 |
6 | 2020-02-15 | 8:02 | 1.0 |