rxts

Period apply after filtering missing data


I have a large xts of 10 minute data that I am trying to average hourly. The data also has some missing data within it. I want to average all of the hours that have >50% data completeness but replace all hours that have <50% data completeness with NA. How would I go about doing that?

Representative data:

library(lubridate)
library(xts)
set.seed(1001)
starttime <- ydm_hms('2001-01-01 10:00:00')
endtime <- ydm_hms('2001-01-01 12:50:00')
timevec <- seq(starttime,endtime,by='10 min')
data <- data.frame(replicate(2,sample(0:10,length(timevec),rep=TRUE)))
datana <- as.matrix(data)
datana[sample(seq_along(datana),0.4*length(datana))] <- NA
dat_na <- as.data.frame(datana)|> type.convert(as.is=TRUE)
dat_natime <- cbind(time=timevec,dat_na)
xtsdatna <- xts(dat_natime,order.by = dat_natime$time)

Expected result

                 time   X1   X2
1 2001-01-01 10:00:00 4.50 6.20
2 2001-01-01 11:00:00   NA 8.00
3 2001-01-01 12:00:00 6.25 4.67

Solution

  • You can write a function which checks if the portion of NA values is above a threshold and returns NA if so, and the mean otherwise.

    mean_or_na <- function(x, na_threshold){
      if (mean(is.na(x)) > na_threshold) {
        NA_real_
      } else {
        mean(x, na.rm = TRUE)
      }
    }
    

    I'm not familiar with xts, but using the data.frame format you could apply the function with:

    suppressPackageStartupMessages({
    library(dplyr)
    library(lubridate)})
    
    dat_natime %>% 
      group_by(time_hour = floor_date(time, 'hour')) %>% 
      summarise(across(starts_with('X'), \(x) mean_or_na(x, 0.5)))
    #> # A tibble: 3 × 3
    #>   time_hour              X1    X2
    #>   <dttm>              <dbl> <dbl>
    #> 1 2001-01-01 10:00:00   7   NA   
    #> 2 2001-01-01 11:00:00   2.8  3   
    #> 3 2001-01-01 12:00:00   2.5  5.33