rdataframetime-seriesna

Fill Gaps By Taking Column Daily Averages


I'm preparing to run Random Forest, but first I need to get my dataset in a form that would be functional for that. As such, I'm trying to fix a few problems with it. I have theoretical solutions, but I don't know how to execute them:

  1. There are NA gaps in certain columns that I need to fill.

Since the data is a time series that contains half-hours values for over a year, I'm planning on finding the daily average for each day and filling in the NAs with the average for the corresponding day (I realize that this isn't the most statistically sound method, but I just need code that runs at the moment)

  1. Not all of the variables are numeric (some are dates, some are character)

I plan on removing the character columns en masse (though I don't know what code would accomplish this). However, I need to keep the date columns, but I'm not sure how to maintain them when I'm taking averages of every column

  1. The dataset is very large (~200 variables, ~17,000 observations)

I'm hoping to use code that takes the averages of all of these columns at once, so that I don't have to manually go through every column.

I have encountered asks about averaging entire columns indiscriminately (instead of grouped by days) and asks about keeping only the numeric variables (but I need to keep the dates as well). These unfortunately won't work.

Below is example data (I know that the TIMESTAMP and dates might be interpreted as characters by R. I don't know how to fix this when giving sample data, but they read as dates in my dataframe):

TIMESTAMP <- c("2019-04-27 17:30:00", "2019-04-27 18:00:00", "2019-04-27 18:30:00", "2019-04-27 19:00:00", "2019-04-27 19:30:00", "2019-04-28 10:00:00", "2019-04-28 10:30:00", "2019-04-28 11:00:00", "2019-04-28 11:30:00", "2019-04-28 12:00:00")
dates<-c("2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-27", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28", "2019-04-28")
ch4_flux <- c(NA, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,NA, 61.54,61.53)
distance <- c(1000,1000,NA,125.35,1000,NA,1000,5.50,NA,1000)
Tau <-c(0.0322000, 0.0495000, 0.1737616, 0.1772567, NA, 0.1246816, 0.1435230, 0.1098670, NA, NA)
filename<- c("2019-04-27T173000_AIU-2079.ghg","2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg","2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg",NA,  "2019-04-28T100000_AIU-2079.ghg","2019-04-28T103000_AIU-2079.ghg", "2019-04-28T110000_AIU-2079.ghg",NA)

dd<- data.frame(TIMESTAMP, dates, ch4_flux, distance,Tau, filename)

The head of the original raw dataset is below:

>dput(AMPdates2[1:10, 1:8])

structure(list(TIMESTAMP = structure(c(1556386200, 1556388000, 
1556389800, 1556391600, 1556393400, 1556395200, 1556397000, 1556398800, 
1556400600, 1556402400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Time_period_check = c(30, 30, 30, 30, 30, 30, 30, 30, 30, 
    30), filename = c("2019-04-27T170000_AIU-2079.ghg", "2019-04-27T173000_AIU-2079.ghg", 
    "2019-04-27T180000_AIU-2079.ghg", "2019-04-27T183000_AIU-2079.ghg", 
    "2019-04-27T190000_AIU-2079.ghg", "2019-04-27T193000_AIU-2079.ghg", 
    "2019-04-27T200000_AIU-2079.ghg", "2019-04-27T203000_AIU-2079.ghg", 
    "2019-04-27T210000_AIU-2079.ghg", "2019-04-27T213000_AIU-2079.ghg"
    ), date = structure(c(1556323200, 1556323200, 1556323200, 
    1556323200, 1556323200, 1556323200, 1556323200, 1556323200, 
    1556323200, 1556323200), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    time_adjusted = c("4/27/2019 17:30", "4/27/2019 18:00", "4/27/2019 18:30", 
    "4/27/2019 19:00", "4/27/2019 19:30", "4/27/2019 20:00", 
    "4/27/2019 20:30", "4/27/2019 21:00", "4/27/2019 21:30", 
    "4/27/2019 22:00"), time = c("17:40", "18:10", "18:40", "19:10", 
    "19:40", "20:10", "20:40", "21:10", "21:40", "22:10"), DOY = c(117.735, 
    117.756, 117.777, 117.798, 117.819, 117.84, 117.86, 117.881, 
    117.902, 117.923), daytime = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 
    0)), row.names = c(NA, 10L), class = "data.frame")

Solution

  • library(dplyr)
    
    dd %>% 
      mutate(DAY_grp = as.Date(TIMESTAMP)) %>% 
      mutate(across(where(~is.numeric(.)), ~zoo::na.aggregate(.x)), 
             .by = DAY_grp) %>% 
      select(-DAY_grp)
    
    #> # A tibble: 10 × 6
    #>    TIMESTAMP           dates      ch4_flux distance    Tau filename             
    #>    <chr>               <chr>         <dbl>    <dbl>  <dbl> <chr>                
    #>  1 2019-04-27 17:30:00 2019-04-27     64.9   1000   0.0322 2019-04-27T173000_AI…
    #>  2 2019-04-27 18:00:00 2019-04-27     66.4   1000   0.0495 2019-04-27T180000_AI…
    #>  3 2019-04-27 18:30:00 2019-04-27     65.4    781.  0.174  2019-04-27T183000_AI…
    #>  4 2019-04-27 19:00:00 2019-04-27     64.4    125.  0.177  2019-04-27T190000_AI…
    #>  5 2019-04-27 19:30:00 2019-04-27     63.5   1000   0.108  2019-04-27T193000_AI…
    #>  6 2019-04-28 10:00:00 2019-04-28     62.8    668.  0.125  <NA>                 
    #>  7 2019-04-28 10:30:00 2019-04-28     62.2   1000   0.144  2019-04-28T100000_AI…
    #>  8 2019-04-28 11:00:00 2019-04-28     62.0      5.5 0.110  2019-04-28T103000_AI…
    #>  9 2019-04-28 11:30:00 2019-04-28     61.5    668.  0.126  2019-04-28T110000_AI…
    #> 10 2019-04-28 12:00:00 2019-04-28     61.5   1000   0.126  <NA>
    

    Created on 2025-03-18 with reprex v2.1.1