[SOLVED] Period apply after filtering missing data

Period apply after filtering missing data

I have a large xts of 10 minute data that I am trying to average hourly. The data also has some missing data within it. I want to average all of the hours that have >50% data completeness but replace all hours that have <50% data completeness with NA. How would I go about doing that?

Representative data:

library(lubridate)
library(xts)
set.seed(1001)
starttime <- ydm_hms('2001-01-01 10:00:00')
endtime <- ydm_hms('2001-01-01 12:50:00')
timevec <- seq(starttime,endtime,by='10 min')
data <- data.frame(replicate(2,sample(0:10,length(timevec),rep=TRUE)))
datana <- as.matrix(data)
datana[sample(seq_along(datana),0.4*length(datana))] <- NA
dat_na <- as.data.frame(datana)|> type.convert(as.is=TRUE)
dat_natime <- cbind(time=timevec,dat_na)
xtsdatna <- xts(dat_natime,order.by = dat_natime$time)

Expected result

                 time   X1   X2
1 2001-01-01 10:00:00 4.50 6.20
2 2001-01-01 11:00:00   NA 8.00
3 2001-01-01 12:00:00 6.25 4.67

Solution

You can write a function which checks if the portion of NA values is above a threshold and returns NA if so, and the mean otherwise.

mean_or_na <- function(x, na_threshold){
  if (mean(is.na(x)) > na_threshold) {
    NA_real_
  } else {
    mean(x, na.rm = TRUE)
  }
}

I'm not familiar with xts, but using the data.frame format you could apply the function with:

suppressPackageStartupMessages({
library(dplyr)
library(lubridate)})

dat_natime %>% 
  group_by(time_hour = floor_date(time, 'hour')) %>% 
  summarise(across(starts_with('X'), \(x) mean_or_na(x, 0.5)))
#> # A tibble: 3 × 3
#>   time_hour              X1    X2
#>   <dttm>              <dbl> <dbl>
#> 1 2001-01-01 10:00:00   7   NA   
#> 2 2001-01-01 11:00:00   2.8  3   
#> 3 2001-01-01 12:00:00   2.5  5.33