I have a large xts of 10 minute data that I am trying to average hourly. The data also has some missing data within it. I want to average all of the hours that have >50% data completeness but replace all hours that have <50% data completeness with NA. How would I go about doing that?
Representative data:
library(lubridate)
library(xts)
set.seed(1001)
starttime <- ydm_hms('2001-01-01 10:00:00')
endtime <- ydm_hms('2001-01-01 12:50:00')
timevec <- seq(starttime,endtime,by='10 min')
data <- data.frame(replicate(2,sample(0:10,length(timevec),rep=TRUE)))
datana <- as.matrix(data)
datana[sample(seq_along(datana),0.4*length(datana))] <- NA
dat_na <- as.data.frame(datana)|> type.convert(as.is=TRUE)
dat_natime <- cbind(time=timevec,dat_na)
xtsdatna <- xts(dat_natime,order.by = dat_natime$time)
Expected result
time X1 X2
1 2001-01-01 10:00:00 4.50 6.20
2 2001-01-01 11:00:00 NA 8.00
3 2001-01-01 12:00:00 6.25 4.67
You can write a function which checks if the portion of NA values is above a threshold and returns NA
if so, and the mean otherwise.
mean_or_na <- function(x, na_threshold){
if (mean(is.na(x)) > na_threshold) {
NA_real_
} else {
mean(x, na.rm = TRUE)
}
}
I'm not familiar with xts, but using the data.frame format you could apply the function with:
suppressPackageStartupMessages({
library(dplyr)
library(lubridate)})
dat_natime %>%
group_by(time_hour = floor_date(time, 'hour')) %>%
summarise(across(starts_with('X'), \(x) mean_or_na(x, 0.5)))
#> # A tibble: 3 × 3
#> time_hour X1 X2
#> <dttm> <dbl> <dbl>
#> 1 2001-01-01 10:00:00 7 NA
#> 2 2001-01-01 11:00:00 2.8 3
#> 3 2001-01-01 12:00:00 2.5 5.33